diff --git "a/data/yt_podcast_transcript.csv" "b/data/yt_podcast_transcript.csv" --- "a/data/yt_podcast_transcript.csv" +++ "b/data/yt_podcast_transcript.csv" @@ -1,6 +1,83 @@ -title,url,length,publish_date,transcript,total_words +title,url,duration,publish_date,transcript,total_words Sarah Catanzaro — Remembering the Lessons of the Last AI Renaissance,https://www.youtube.com/watch?v=v3O20NMdOuA,4584,2023-02-02,"Sarah: I think people see the output of models like DALL·E, GPT-3, et cetera, and they're amazed by what AI can do. And so the conversation doesn't even hinge on, ""We have access to this data set,"" or ""We have access to this talent pool."" It's more, ""AI is magical. What can we do with it? Come in and talk to us about this."" And again, I think that that is somewhat dangerous. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Sarah Catanzaro was a practicing data scientist and then went into venture. She's currently a General Partner at Amplify Partners, and one of the leading investors in AI and ML. Her investments include a whole bunch of companies I admire, like RunwayML, OctoML, Gantry, and others. It's really interesting to talk to an investor who's also technical. She has insights both on how the technology is built and how it's being adopted by the market at large. This is a really fun conversation and I hope you enjoy it. Sarah, thanks so much for doing this. I've been looking forward to this one. I had a bunch of questions prepped and then I was looking at your Twitter and I was like, ""Oh, there's like a whole bunch of stuff that we should..."" Sarah: Yeah. I feel like I've been doing a lot of thinking out loud recently. Including in response to a lot of the hype around Stable Diffusion, LLMs, et cetera. I appreciate the fact that both of us were there in the 2013, 2014 phase where every company was claiming to be an AI company. It feels like we're kind of heading down that road again, which scares me a little bit. I hope at least there are enough companies — people — who remember the lessons of the last AI renaissance. But we'll see. Lukas: Well, let's get right into it then, because from my perspective, I totally remember at least one other AI bubble. Maybe more, depending on how you count it. I guess from where I sit, it feels like this one might be different in the sense that I feel like these challenges that were always...seemed super, super hard, seem like they're really working. And I feel like I see applications happening unbelievably fast after the paper comes out. Actually even maybe before there's time to even publish any paper on the topic. I think I might be more bullish about large language models and Stable Diffusion than you, which is great because we can actually have an interesting conversation here. But I thought it's interesting. You've invested in Runway, and just the other day Cris was showing me a natural language input into Runway where you could basically type what you want, and it would sort of set up the video editing to work that way. I thought, ""Oh my gosh,"" this might be a totally new kind of interface that lots of software might quickly adopt, I guess. But it sounds like — looking at your Twitter — it sounds like you were playing with large language models and finding it super frustrating and broken. Tell me about that. Sarah: Yeah, so I think my concern is less about the capabilities of large language models specifically, and more about some of the lessons that we learned during the last AI renaissance. Which I think was roughly like 2014 to maybe 2017, around the time that AlphaGo came out. People were really excited about the capabilities of GANs and RL. At the time, I remember companies like Airbnb, Uber, Lyft building these big research teams, but not really having a clear agenda for those research teams, or understanding how the objectives of their research teams might align with the objectives of the broader organization. And then similarly, you saw all of these startup founders emerge that were talking about changing healthcare with GANs or changing finance with RL, but didn't really have insights into the nuances of those industries. My feeling of why ML didn't work the last time around — or rather, why ML adoption didn't occur at the pace that we anticipated — was that it was not really a technical problem, but rather a product, go-to-market problem. I am hoping that this time around, we've both learned from our mistakes but also — in the intervening time period — created enough enabling technologies, such that two things can occur. One is that companies can fail fast. Frankly, one of the things that scares me is that back then I remember a bunch of companies reaching out and basically saying things like, ""Hey, we've got a bunch of data. We'd love for you to come in and talk to us about our AI strategy,"" and thinking, ""I don't care if you have a bunch of data. Let's talk about a bunch of problems that you have, and how ML can solve those problems."" I've come to believe that you can't fight that urge. Founders will always be enticed by the promise of AI. But if they're able to experiment with it quickly, then I think they can start to learn more about the infrastructure, and data, and other investments that they may need to make in order for their AI initiatives to be successful. At the same time, I think by creating these higher-level interfaces that make ML more accessible to potentially the domain expert, it allows people with a more thorough understanding of business problems to at least prototype AI solutions. I'm somewhat skeptical that these very high-level interfaces will allow them to build production ML at scale, but at least they can see, ""Does it work? Do I need to now hire a data/ML team to realize this initiative further?"" Lukas: Do you have companies in mind that you like, that are creating these higher-level interfaces off of ML technology, that makes them usable for real world applications? Sarah: Yeah. I think Runway is actually a perfect example of the phenomena that I see playing out. Some people may not know, but Runway actually started off more as a model marketplace. Their goal had been to make GANs and other types of models accessible to creative professionals, but they weren't really focused on building out the video editing tools, at least initially. They created these higher-level interfaces, such that various creative professionals — whether it was artists, or directors, or photographers — could start to experiment with ML models. What they saw was that some of the most popular models were models that automated routine tasks associated with video editing. Based on that user behavior, they decided to double down on video editing. In fact, a lot of the model architectures that they've since created — including Stable Diffusion — were really purpose-built to support the workflows of video editors. I like that sort of workflow, where you use a prototype, or you use these higher-level interfaces to get insight into what users need — as well as potentially the limitations of the underlying technology — and then you iterate from there. Lukas: I totally remember a time, I think, of the era you're talking about — 2014 to 2017 — when every company was like, ""Oh, we have this data. it must be valuable because we can build a model on top of it."" Do you see some analogy today to that? What's the common request of an ML team that's misguided, or should be thinking more about problems? Because I feel like data maybe isn't seeming quite as valuable, in the world of LLMs and big models. Sarah: I think that what we're seeing today is arguably more nefarious than what we saw back then, because at least at that point in time, companies had invested in collecting data. They had thought about possibly what data to collect. And so there was some understanding of how to work with data. I think people see the output of models like DALL·E, GPT-3, et cetera, and they're amazed by what AI can do. And so the conversation doesn't even hinge on, ""We have access to this data set,"" or ""We have access to this talent pool,"" or ""We have this type of workflow that could benefit from these generative capabilities."" It's more, ""AI is magical. What can we do with it? Come in and talk to us about this."" And again, I think that that is somewhat dangerous. I was at a conference just last week. There was a presentation on ML infrastructure at a music company, and somebody in the audience asked, ""Does the AI listen to songs?"" It's a perfectly reasonable question. But I think it does kind of belie some of the misunderstanding of AI and how it works. Lukas: In what sense? Sarah: I think people think about AI as artificial agents. They think of AI as something that could listen to a song, not just something that could represent a song and make predictions based upon the content of that song. Again, I think better understanding of what LLMs are and what they can do will be really necessary to identify when they can be useful. Lukas: This might sound...this is a little bit of a soft ball — or might sound like a soft ball — but I was really genuinely interested in this. I feel like one of the things that you do really well, at least in my conversations with you, is maintain a pretty deep technical and current knowledge of what's going on in data stacks, basically. Or, data infrastructure and ML infrastructure. But yet you're not maintaining data infrastructure — as far as I know — so I'm kind of curious how you stay on top of a field that seems like it requires such hands-on engagement to understand it well. Or at least I feel like it does for me. Yeah, just curious what your process is. Sarah: Yeah. It's interesting because I'd say that, in some ways, that is one of my biggest concerns. I've been in venture now for about seven years, and so I can still say that I've spent most of my career in data. But it won't be long before that is no longer true. And certainly I have found that my practical, technical skills have gotten rustier. One comment on that is that I do think that losing my Python, SQL skills, etc. has actually enabled me to look at some of the tools and platforms that are available to users today, with a fresh set of eyes. I'm not as entrenched in the same patterns of behavior and workflows as I was when I was a practitioner. So it's been helpful to shed some of my biases. But I think what I've discovered is that you can understand how something works without using it. And therefore there are two things that are kind of critical to building technical understanding for me. One is just spending a lot of time with practitioners, and hearing about their experiences. How they're using various tools, how they're thinking about various sets of technologies. Frankly, just learning from them almost feels like a shortcut. Instead of trying to figure out what the difference is between automated prompting and prefix-tuning, just going to ask somebody and have a conversation with them. Which is kind of coincidental, and perhaps even ironic. Like, accelerate my learning by just learning from people with expertise in those areas. There's a lot that I just learned through conversation with practitioners. But I think going one level deeper — either reading white papers or reading research papers that give you kind of a high-level overview of an architecture, or how something works without getting into the nitty gritty of the underlying code or math — allows me to reason about these components at a practical level of abstraction. I can see how things fit together. I understand how they work. That doesn't necessarily mean that I'd be able to implement them. Definitely doesn't mean that I'd be able to iterate on them. But it's enough depth to reason about a component, and it's placed in a broader technical stack. Lukas: It's funny though, sometimes I feel like investors...I mean all investors do that to some extent, and I totally get why. But I think that I often feel also paranoid about losing my technical skills, because I feel like if all you can do is sort of figure out what box something belongs to, it's really hard for you to evaluate the things that don't fit into boxes. And I feel like almost all the interesting advances — actually, all the products that we want to come out with at Weights & Biases — generally is stuff where it doesn't fit neatly into one of those ML workflow diagrams that people make. Because if it was one of those boxes, then of course people are doing it, because it makes logical sense, but it's sort of when that stuff gets reshuffled...it does seem like you're able to maintain a much greater level of technical depth than the average investor, even in the data space. Which is why I wanted to have you on this podcast. I hope I'm not offending any of my current investors. Just a caveat there. You all are wonderful. I really do feel like you somehow maintained a much greater technical depth than most of your colleagues. Sarah: In many ways I'm amazed by my colleagues and what they do, because I think there are many investors that can reason about the growth of companies, and reason about sets of boxes and the relationships between those boxes without understanding what those boxes do. I don't think I could do that, but I've always also just been the type of person who needs to go a little bit deeper. As an example, I started my career in data science, but at Amplify I also invest in databases. And at some point — writing SQL queries, working with dataframes — I just wanted to better understand what was happening. When I write a SQL query and data shows up in my SQL workbench, what is happening on my computer? I think a lot of people take that stuff for granted. And they can. That is the beauty of abstractions. That is the beauty of technology. We are able to have this video conference — we are able to connect over the Internet — without understanding how the Internet works. My personality is such that I want to understand how the Internet works. I want to understand why I have service in some places and why I don't have service, and why my dataframe is slower than my SQL query. I do think that that makes me think about technical systems in different ways. Lukas: It’s funny, my co-founder Shawn is obsessed with — in technical interviews — assessing if someone understanding how a computer works, in his words. Which I think is really interesting, because I feel like I'm actually not... That's kind of a weakness of mine, I always wonder about a lot of the details there, but it is sort of an interesting perspective. I love working with all of my colleagues who have that same drive to understand how everything works. Okay, here's another question that I was wondering, I was thinking about. If I were to come to you, and I had a company in the data/ML space, and I had a bunch of customers that were really who we think of as tech-forward — like Airbnb, and Google, and that genre — would that be more impressive? Or would you be more thinking I'm likely to succeed if I came to you with a set of customers who we don't normally think of as tech-forward? Like an insurance company — a large insurance company — and a large pharma company. Which would you look at and say, ""Oh, that seems like that company is going to succeed""? Because part of me watches technology flow from the more tech-forward companies everywhere. But another part of me is like, ""Wow, these kind of less tech-forward companies have a whole set of different needs and often a different tech stack. And certainly there's more of them and they have more budget for this stuff."" So which would be the more impressive pitch for you? Sarah: Yeah, it's funny because I think in many ways the way that VCs make decisions — the way that we think about deals — is actually super similar to some of the patterns that we observe with neural networks. And that of course means that we have bias. It also means that we learn from patterns that we've observed. So, I can give you the honest answer, and then I can also give you the rational answer. The honest answer is that I would be more impressed by a company that has engaged with tech-forward customers. For the reasons that you described. In the past, we have generally seen that tech will spread from the Airbnbs and Ubers and FAANGs of the world into the enterprise, and not the other way around. We also have a bias that these more traditional enterprises tend to move slower. There tends to be a lot of bureaucratic red tape that you need to navigate. And as such, those markets tend to be less attractive. So, on its face, if you just said...you don't have any additional information about the velocity of sales, about the quality of the tech or team, etc. But like you're- Lukas: -holding them equal, I guess. Equivalent. Sarah: Yeah. That said, I think that is one of the biases that can cause us to make poor decisions. What really matters are some of the things that I just alluded to. If you're able to sell into insurance companies repeatedly — and with high velocity — that is arguably a better business than a company that spends 6 to 12 months trying to sell into tech companies. So it's less about ""To whom do you sell?"" and more about, ""Is that a big market? Are you able to sell efficiently? Are you able to sell scalably?"" I think sometimes we need to be aware of our biases and the impact that marquee logos can have on our decision-making. Lukas: Well, I can't tell if you think it's a rational bias or not. I mean, in some sense, you could call all pattern-matching biases. Do you really think it would be rational to sort of be less enamored with tech-forward customers than you actually are? Sarah: I think we need to ask ourselves and probe on, ""Under what circumstances might enterprises move quickly?"" A great example of this is a company called Afresh, which was one of the companies that did use RL to disrupt an industry. At that time that so many companies were trying to do the same thing, but didn't have as much insight into what was happening within an industry. They offer tech solutions — including things like inventory management and forecasting — to companies in the grocery space. Now, you might think that grocery is going to be a super outdated, slow-moving industry. And therefore that selling into grocery chains would be long and tedious. And perhaps not very scalable. But, at the time, a lot of grocery stores were responding to — and/or otherwise just terrified by — the acquisition of Whole Foods by Amazon. This was then [followed] by the pandemic, which certainly put a lot of stress on their online and multi channel-delivery and e-commerce capabilities. So there were these exogenous shocks which made what might have been slow-moving market participants move a lot faster. Those are the phenomena that we're sometimes blind to, because we just hear ""grocery"" or ""healthcare"" or ""manufacturing"" and think ""slow"", rather than thinking, ""What would it take for the participants in that sector to move fast?"" Lukas: That makes sense. Here's another point that you made on Twitter, that I was contemplating. I actually don't think I have a strong point of view on this, although I really should — given the company that I'm running — but you mentioned a lot of VCs have been saying that you expect the point solution MLOps space to consolidate. One thing that's interesting about that, is that I think you've invested in some MLOps tools. Do you sort of expect them to expand in scope and eat the other companies? Is that something that you need to bet on when you invest in them? Or would you be happy to see them get bought by other tools? How do you think about investment then, in MLOps tools companies, with that worldview? That's my practical question. And then the other thing that I observe, is that it doesn't necessarily seem like developer tools in general is consolidating. So I think I might even agree with you, but I wonder how you sort of pattern match that against developer tools. Or even maybe the data stack... I don't know. Do you think that the data stack is also consolidating? Or what's going on there? Sorry, I just dumped a whole bunch of different questions on you, but... Sarah: Those are great questions. So, I do think that in general most technical tools and platforms will go through phases of consolidation and decoupling. Or, as people love to say today, bundling and unbundling. I think it's just the nature of point solutions versus end-to-end platforms. You have a bunch of point solutions, they're difficult to maintain, they may be challenging to integrate. You then kind of bias towards end-to-end platforms, you adopt an end-to-end platform. It doesn't address a certain edge case or use case that you're experiencing, so you buy a new tool for that edge case, and unbundling happens. I think the pendulum will always swing back and forth between bundling and unbundling, for that reason. Or coupling and decoupling, for that reason. To be clear, as a former buyer, I don't think that point solutions or end-to-end platforms are the best solutions for a company. I think there's space in the middle, where you have a product that can solve a few adjacent problems. That's typically what I look for when I invest. I want to make sure that the company in which I'm investing is solving an urgent — and often point — problem. They're solving an urgent and specific problem. However, I typically also want to see that the founder has a hypothesis about how they would expand into adjacent problem areas. It's not that I think solving point problems is bad, but I do think given the pendulum of coupling and decoupling, having some hypotheses about the areas that you can expand into becomes critical. It's interesting to consider why this may or may not happen in the world of developer tools. I'd argue that you still see consolidation. However, the consolidation tends to happen across layers of the stack, versus across the workflow. Lukas: Interesting. What are you...tell me...what are you thinking of there? Sarah: Things like serverless, where you're no longer reasoning about resources and config. That might not be impacting other parts of your developer workflow. That might not be eating into your git-based development workflows, or your testing processes, and things like that. But it is eating into how you think about managing VMs or containers. It is possibly eating into how you think about working with cloud vendors, and deciding upon underlying hardware, and things like that. So it might be the case, that it's like in software development, we've seen companies — or we've seen vendors — solve specific problems, but solve those all the way down the stack. I haven't really thought about that as deeply. But I think it's a worthwhile question to ask. I would say that one of the big differences, though, that I see — and that we of course need to be mindful of — is that there are far more developers than there are data practitioners. And so, when you're trying to answer the question, ""How does this thing get big?"", those building developer tools can arguably solve a specific problem for a larger number of people versus data teams when you're trying to answer this question of, ""How does this get big?"", you could potentially get stumped just by the number of people for whom a tool is actually applicable. Lukas: Is that what gives the intuition that we're in a moment of bundling? That there's just all these point solutions that you feel kind of can't survive on their own, just given the size of the market that they're in? Sarah: I think it's a combination of things. On one hand, I see a lot of...the slivers are getting tinier. You start to see things like ""model deployment solutions for computer vision,"" and perhaps some subset of computer vision architectures. Where, you might think to yourself, ""Okay, I understand why the existing tools are maybe not optimal for that specific use case, but that's really narrow."" To my point about thinking about these orthogonal problems, it's unclear how you go from that to something meatier. That's one phenomena that I observed. I think the other is just that practitioners are really, really struggling to stitch things together. The way a friend put it to me about a year ago, he basically said he feels like vendors are handing him a steering wheel, and an engine, and a dashboard, and a chassis, and saying ""Build a fast, safe car."" Those components might not even fit together, and there's no instruction manual. It's easy to cast shade on the startups that are building these tools and platforms, but I think one of the things that is more challenging in the ML and AI space than even like data and analytics, is that a lot of the ML engineering and ML development workflows are really heterogeneous now. If you're a vendor and you're trying to think about, ""With whom should I partner? With whom should I integrate? Do I spend time on supporting this integration?"", it's tougher to make those decisions when practices and workflows are so fragmented and heterogeneous. I do think that creating more of a cohesive ecosystem has been difficult not because vendors are dumb, but because there's just a lot going on. Lukas: Well, I think the other challenge maybe is that when there's so many different technologies that people want to integrate into what they're doing — because there's so much exciting research and things that come along, based on different frameworks and so on — it's hard to imagine an end-to-end system that would actually be able to absorb every possible model architecture immediately, as fast as companies want to actually use it. Sarah: Yeah, yeah 100%. I have been thinking about this in the context of LLMs. We don't yet know how the consumers or users of pre-trained models are going to interact with those who create the pre-trained models. Will they be doing their own fine-tuning? Will they be doing their own prompt engineering? Will they just be interacting with the LLM via API? Without insight into those interaction models, it's really hard to think about building the right set of tools. It's also unclear to me that the adoption of LLMs would actually imply that we need a new set of tools, both for model development and deployment, and management in production. I have a lot of empathy for people who are building ML tools and platforms because it's a constantly moving target. Yet, there's the expectation that you're able to support heterogeneity in all regards. In all regards, whether it's the model architecture, or the data type, or the hardware backend, or the team structure, or the user skill sets. There's so much that is different from org to org. I think building great tools is really challenging right now. Lukas: I guess that's a good segue to a question I was going to ask you. When you look at LLMs, do you have an intuition on if a new set of tools are needed to make these functional? Sarah: I think one of the bigger questions that I have is, again, on how the consumers of LLMs — or how the users of LLMs — will actually interact with those LLMs. And more specifically, who will own fine-tuning. I imagine that there are certain challenges that will need to be addressed, both with regards to how we collaborate on the development of the LLMs, but also how we think about the impact of iterations on LLMs. If OpenAI wants to retrain one of their models — or otherwise tweak the architecture — how do they evaluate the impact of that change on all of the people who are interfacing with the GPT-3 API, or with any of their other products? I think a lot of the tools that were built for model development and deployment today kind of assumed that the people who were developing models would be the same set of people — or at least within the same corporate umbrella — as those who are deploying and managing models in production. And if LLMs drive a shift — wherein those who are developing models and those who are deploying and building applications around models are two completely separate parties — then some of the tools that we have today might be ill-suited for that context. Lukas: Do you think we're headed towards a world like that, where there's a small number of companies generating foundational models? And then mostly what other companies are doing is fine-tuning them or doing some kind of prompt engineering to get good results out of them? Sarah: Here we're getting a little bit into the technical nitty gritty, but my impression from tracking the research community so far has been not all...though LLMs are great for what we typically think of as unstructured data — primarily images, text, video, et cetera, audio too — they have not outperformed gradient boosting or more traditional methods on structured data sets, including tabular and time series data. Although there's some work on time series that I think is pretty compelling. This is one of those areas where I feel like the research community just completely underestimates how many businesses operate on structured data. While it's possible that adoption of LLMs will drive this new interaction model or new market model — wherein some companies built these large foundation models and others interact with those — I don't see gradient boosting or more classical approaches going anywhere. Because I don't see structured data going anywhere. Arguably, structured data powers many of the most critical use cases within organizations, ranging from search and recommendation engines to fraud detection. I think it would be a tragedy to neglect the needs of those who are using...I don't want to say simpler approaches, but certainly simpler approaches and more complex approaches, by using architectures that are not perhaps attention-based, when working with these specific data sets. Lukas: Interesting. Do you have an opinion on...how to say this? I feel like many investors especially, but I think many smart people looking at the space of ML and data, they think, ""Wow, this is gonna commoditize. This is going to get...tools are gonna make this easier. Less companies are going to want to do this internally and spend money on expensive resources."" But I guess when I look at what companies actually do, it seems like they spend more and more, and even kind of push up the salaries. And they have this fight for scarce, specific talent. Which way do you sort of predict things are going? Do you think like 10 years down the road, ML salaries go up or do they go down? Maybe it's a more concrete way of putting it. Sarah: Yeah, that's a great question. I probably expect that the variance would increase. My guess is that there are certain applications that may be commoditized — or at least that may be commoditized for some subset of the market — while others continue to be pursued in-house. Search is perhaps a very interesting example. For some businesses, they may be more than happy to rely upon a vendor to provide those semantic or vector-based search capabilities. While search may have an impact on their bottom line, perhaps it's not the most critical or most impactful thing to their business, but rather just a capability that they have. This is not to say that Slack actually uses a vendor or should use a vendor, but as far as I can tell, Slack doesn't really monetize on search. You'd contrast that, however, with an e-commerce business or something like Google, where their ability to deliver the highest quality search results and their ability to improve search — just marginally — could be a huge impact on revenue. Those companies are probably likely to develop their own models. I think we'll see that some companies do their own model development. Some use cases are not commoditized, and those companies for those use cases you see very high ML salaries. But then, perhaps for others, you're really just a software engineer who knows a little bit about ML, and can interface with some of these models through APIs, and can reason about the output of experiments and behavior that you might see in production. Lukas: I guess in that vein — and you sort of alluded to this earlier a little bit — what do you think about all these sort of low-code and no-code interfaces into exploring data, building ML models? You mentioned earlier that you think that's generally a really exciting trend. Sarah: My opinions on this category are pretty nuanced, so I was thinking about where to start. Generally speaking, I'm very skeptical of no-code, low-code solutions. I find that many of these tools — no matter what the sector or what the use case — they end up shifting the burden of work. Not necessarily removing that burden, or even lightening that burden. A great example is self-service analytics. My own belief is that in general, most self-service analytics tools don't actually reduce the burden that the data team or analytics team bears, but rather shifts the work of the data team from building analytics products to debugging, explaining, or fixing analytics products. And I think the same can be true in the ML space. Why I'm excited about some of these tools in the ML space is that I actually think that in ML, failing fast is really critical. Some of these tools that enable users to prototype ML-driven solutions might help them better understand, ""Is this going to work? What additional investments do I need? What do my users expect from the system before they make a decision to invest further?"" It enables that kind of quick prototyping, learning, and failing fast. The other thing that I feel quite strongly about, is that we need to explore ways to decouple model development and ML-driven app development. Whenever I talk to companies about their ML architectures or their ML stack, it becomes so obvious that ML is just this one tiny component in a much larger app architecture. The prediction service might be connecting with other databases, or stream processing systems, or other microservices, tools for authorization, and so on and so forth. I think it's really important to be able to build applications around a prediction service while independently iterating on the model that powers that prediction service. So, I am somewhat long on tools that enable engineers to prototype ML-driven systems, so that they can build those application architectures. And then, once they have a better understanding of the full system requirements — including some of the latency associated with things like moving data around — they can kind of pass off a fuller spec to a data scientist who will iterate on the model and model architecture, armed with the knowledge that these are the attributes that we need in order to make this project successful. Lukas: That makes sense. Okay, another question. When you invest in a company that is providing some kind of ML or data service, does it cross your mind, ""What if AWS does that?"" Or GCP or Azure. Is that an important thing to consider, do you think, or is that irrelevant? Sarah: Yeah, yeah. I smile because I feel like this question, it comes up somewhere between like one to five times a week. Given the areas that Amplify invests in — we're primarily focused on data, ML tools and platforms, enterprise infrastructure, and developer tools — we're constantly fielding this question of, ""What if AWS or GCP or Azure does this? Won't that company — won't that market, et cetera — get crushed?"" In the past, what I've told people is that I have found that startups tend to be better at building developer experiences. Anecdotally, this is just something that we observe. People complain a lot about the experience of using AWS tools, the experience of using things like SageMaker. I've thought a little bit more about why that's the case. I think, generally speaking, the cloud vendors need to develop for their most spendy customers, their highest-paying customers. And their highest-paying customers tend to be enterprises, shockingly. As such, they're developing for an enterprise user who probably has fairly strict privacy/security requirements, who may have a very distinct way of organizing their teams, who may be bringing in a persona with a specific skill set into data science or ML roles. If I had to present a hypothesis about why they haven't been able to compete on developer experiences, I think it's because often they are creating tools and platforms for a developer who is not as representative of the rest of the market. But, to be honest, with the passage of time, I've just seen enough examples of companies that have been able to out-compete the cloud vendors where I just don't worry about it that much anymore. Lukas: Have you ever seen anyone get crushed? Sarah: Crushed? Lukas: Has that happened in your career? Sarah: No. I mean, I'm sure it has. But it's hard for me to think of an example, whereas it's easy to think of many, many examples of companies that were not crushed by the cloud vendors. If anything, I think sometimes we see that start-ups get...they sell too soon. The way in which the cloud vendors out-compete them is putting some juicy acquisition offer in front of them and then they don't have to compete. That's the only example that I could see or think of, off the top of my head, of the cloud vendors crushing a potential competitor. They crush it with their dollars. Suffocate companies with their acquisition offers. Lukas: R&D through M&A, yeah. I saw an interview or a conversation that you had with Andrew Ng. I thought you had an interesting point that academic benchmarks...they often don't really reflect industry use cases. But you were kind of pointing out that industry has some share of the blame for this. Can you say more on that topic? Sarah: Oh, absolutely. I am really grateful to Andrew for actually drawing my attention to this issue. We often think about the gap between research and industry, but we don't as often think about the gap between industry and research. Andrew and I had been talking about this challenge of structured data versus unstructured data. I think I said to him, ""What I see in industry is that most ML teams are working with tabular and time series data. What I see in the research community is that most researchers are building new model architectures for unstructured data."" There's a big mismatch between what model architectures people in industry need — given the data that is available to them, as well as given the types of problems that they're trying to solve — and the research that's becoming available. Now he pointed out to me — and this is something that I hadn't really thought about before — researchers have access to unstructured data. They have access to things like ImageNet. They don't have access to high volumes of data on user sessions, or logs, metrics, and events. The data sets that tend to be the lifeblood of most companies. It is very difficult to innovate on AI techniques for data sets to which you have zero access. I think it's easy to point to that research and be like, ""Oh, there's such a big gap between what they're building and what we need."" I think we also need to be mindful of what the research community can do, given the resources that they have available to them. I've seen a couple of efforts by a few organizations to open source their data sets, but it's tough because oftentimes the most valuable data sets are the most sensitive ones. What company wants to share their click-through data that probably reveals the state of their business, some of the experiments that they're running, and so on so forth. Lukas: Well, there's also not a lot of upside. I remember the Netflix contest was such a popular, awesome thing. Got so many people involved, so much attention to research to Netflix — still a seminal data set — but they didn't do a second one because they felt like...there are user privacy issues, that they couldn't get around to release it. I don't know if you remember when AOL released a subset of their query logs. It was so exciting to actually have that. I was in research at the time and I was like, ""This data set is like gold."" And then like the next day, they fired the person that released it. And their boss — I think their boss' boss, right? — because there was some personal identifying information in that. It's hard to see a lot of upside for corporations, even if they were sort of neutral on the impact of...on the company secrets, IP issue. Sarah: Yeah. One of the things that I have seen — that has been very encouraging — is more and more interview studies or meta analyses coming out of the research community. Where it's clear that the researchers are interested in better understanding the problems that practitioners face in industry. One critique that I've had of those studies in the past, is that the authors tend to interview people to whom they have immediate access, which means that they often interview practitioners at some of their funding organizations. The organizations that are sponsoring their labs, which means that they tend to bias more towards larger enterprises or big FAANG companies. They're interviewing people at Facebook, Apple, Tesla on their data and ML tools, platforms, practices, and then drawing conclusions about all of industry. But I think that recently I've seen a couple of studies come out where there's been a more focused effort to get a more random — or at least more diverse — sample of practitioners from both smaller startups, more traditional companies, bigger tech companies, et cetera, to really better understand both the similarities and differences between how they approach model development and deployment. I hope that continues. Lukas: Do you have a study that's top of mind, that you could point us to? Sarah: So, Shreya Shankar, who had actually been a university associate. Lukas: Yeah, I saw that. Totally. Nice. Sarah: I was really thrilled because Shreya actually reached out to us and said, ""Hey, can you connect us to people at different types of companies? I've got connections to people at Instagram, Facebook, Apple, et cetera et cetera, but I want to talk to people at mid-market companies, or early-stage startups, and B2B companies, and better understand some of the nuances of their workflows."" Lukas: What was the name of the paper? I think I just saw it. Sarah: ""Operationalizing Machine Learning: An Interview Study"". Lukas: Thank you. Yeah, I agree. That was an excellent paper. Sarah: Yeah, yeah. The other thing that I had said...I sent Shreya a text message after reading through it. The other thing that I really appreciated about the interview study was that she didn't cherry pick the insights that were most likely to drive interesting research questions or solutions. I think she took a really genuine and unbiased approach to thinking about, ""What are the problems that people are talking about? What are the ways in which they're there solving them? Let's highlight that there are a bunch of problems that people are just solving in practical — albeit hacky — ways, but ways that they're content with."" I thought it was a very honest study. Lukas: Totally. I totally agree. Well, I guess if we are possibly headed towards another bubble in machine learning — or machine intelligence, as you sometimes call it — do you have any advice for a startup founder like me? Or maybe an ML practitioner, which is most of our audience. Having gone through another bubble, how would you think about it? What would you do if you started to...I think we're already seeing bubble-esque behavior. What are the lessons? Sarah: I think the most critical lesson that I saw/learned the last time around was, ""Focus on your users,"" or ""Focus on the strategic problems that you're trying to solve."" And ""Really, really understand if and why ML is the best tool to solve that problem."" I think it's critical to think about machine learning as a very important tool in our toolkit. But one of several tools. I was catching up with a friend a couple of weeks ago, and she had mentioned to me that the way in which she prioritizes ML projects is through regular conversations with their product leadership, and engineering leadership — and her representing ML leadership — about the product roadmap, about the user behaviors that they're trying to unlock. And then thinking about whether ML or traditional software development approaches are a better tool for achieving those things. I think as long as we continue to think about ML as a tool to solve problems — and as long as we have the tools that enable us to better understand if ML is solving those problems, and how to improve upon its ability to solve those problems — then ML can be a super powerful tool. And one that we learn to wield in more powerful ways too. But — I feel almost like a broken record saying this, given the lessons learned in the past — if we treat ML like a silver bullet, if we treat it like a hammer looking for a nail...that was the pattern that I think led to failure. Don't think about ""What ML can do for you"", think about ""What you can do for your country,"" and if ML is the right way to do that, I guess. That's the lesson that we learned and I hope it's the lesson that we will carry forth. Lukas: Love it. We always end with two open-ended questions. The first of the two is, if you had extra time, what's something that you'd like to spend more time researching? Or, put another way, what's an underrated topic in data or machine learning? Sarah: Oh man, that one is very easy for me: programming languages. I would love to spend more time learning about programming languages. I am definitely not convinced that Python is the right interface for data science, or that SQL is the right interface for analytics work. I would really love to learn more about programming language design, so that I could better diagnose if and why Python and SQL are the wrong tools, and how one might go about building a better PL interface for data scientists, ML engineers, and analysts. Lukas: Okay, a question that I didn't ask — because I thought it was a little weird or maybe nosy — is why you're asking on Twitter if anyone knew any female Rust developers. Because I will say Rust comes up just a shocking amount on this podcast, and I was wondering what's driving the interest in Rust, and then if there was some reason behind looking for a female Rust developer, and if you actually found one. Sarah: Yeah, yeah. So, full transparency — and I think I maybe put some of this on on Twitter too — quick background is that certainly earlier in my career, I felt like oftentimes I wasn't getting invited to the same set of events, et cetera, as some of my male peers, and therefore I wasn't getting exposure to the same set of conversations — maybe even the same opportunities — to potentially see deals, and things like that. I feel pretty strongly that we need to have women in the room when we host events, to ensure that they're getting exposed to the same set of opportunities. That we're not doing things to hamper their progress in the industries in which they they operate. We were hosting a Rust developer dinner, and looked at the guest list, and there weren't that many women, and it felt like we could do better. Thus the origins of my question. Lukas: I see. Sarah: Why Rust? See, I wish I spent more time studying programming languages, so I could better understand why people are shifting from C++ to Rust. Luca Palmieri — who I believe is now at AWS, actually — has a great blog post on why Rust might be a more appropriate backend for Python libraries that often have C++ backends. Things like pandas, where we experience it as Python but in fact it has a C++ backend. I've heard that Rust is more accessible than C++ and therefore could perhaps invite more data practitioners to actually contribute to some of those those projects. But I don't know enough to really say why Rust is so magical, other than a lot of smart people — apparently, like Linus Torvald too — believe it is. If it's good enough for him, it's good enough for us. I don't know. Lukas: Fair enough. My final question for you is, when you look at the ML workflow today going from research into deployment into production, where do you see the biggest bottlenecks? Or maybe where do you see the most surprising bottlenecks for your portfolio companies? Sarah: I generally think that...there are two bottlenecks that I would call attention to. Actually three, sorry, I'm being kind of indecisive here. One pattern that I've observed with ML is that we often iterate on ML-driven applications — or ML-driven features — more frequently than we iterate on more traditional software features. To give an example, we may iterate on a pricing algorithm far more frequently than we would iterate on a navigation panel, or an onboarding flow, or something like that. Earlier I was talking about understanding how ML can solve user and company problems. I don't really think we have enough insight into the way in which model performance correlates with behavioral data — or the product engagement — to iterate super effectively on models. I think that has been a limitation, and one that could have nefarious effects in the future. Another big challenge that I see — and I alluded to this before — is the challenge of building software applications around a prediction service, or around a model. In the past, people might have talked about this as a model deployment problem. The problem isn't containerizing your model and implementing a prediction service in production. I think that has gotten significantly easier. The problem is connecting to five different databases, each which have different sets of ACID guarantees, latency profiles...also connecting to a UI service, potentially connecting to other application services. The problem is the software development. What you've got is a trained model, but now you actually have to build a software application. I don't think we have great tools to facilitate that process, either for ML engineers or for software engineers. And then around the same space, I also think that the transition from research to production — and back — can still be challenging. Perhaps what a company wants to do — upon seeing an issue associated with the model in production — is actually see the experiment runs associated with that model, so that they might get more insight into what is now happening in that production environment. That shouldn't be difficult to do. But, in the past I think we really developed tools either for model development or for MLOps, and we're starting to see some of the pain points that arise when those sets of tools are not coupled together. Lukas: Cool. Yeah, that all definitely resonates with me. Sarah: Lest I sound too cynical, I am really optimistic about the future of ML. I think we just need to do it in a sane and rational way and be mindful of what we're trying to accomplish here, instead of just focusing on flashy press releases and cool demos. Lukas: I was thinking as you were talking about the hype cycle, and large language models, and stuff. I was thinking VCs probably feel the hype cycle the fastest. I'm like, ""Man, we've basically solved the Turing test and, like, no one cares. My parents are like, ""What even is this,"" you know. It's like, ""Come on, this is awesome, look at it."" But I think every investor knows about Stable Diffusion but I don't think...I even come across Chief Data Officers at Fortune 500 companies who are like, ""What's Stable Diffusion?"" It's like, ""Come on, you should know about this."" Anyway... Sarah: Yeah, yeah. But I think there's this awareness, though, of ""This is where the hard work starts."" Lukas: Yeah, totally. Sarah: ""Great, we're able to generate beautiful artistic renderings based on textual prompts. Okay, how do we generate photos that are equivalent to that which a professional photographer would produce?"" Because that's what it's going to take to get a Getty Images or Flickr to adopt something like Stable Diffusion. How do we make automated rotoscoping so good that a video editor doesn't need to correct the mask at all? Because that's what it's going to take for Runway to compete with some of the more traditional video editors. I saw, through Runway, that the research is not good enough. They've had to do a lot of engineering, as well as their own research, in order to operationalize some of these things. I am so optimistic about the potential of the technologies, but I also am realistic that reining them in, and actually leveraging these technologies to do good in the world — or to build great products — is hard. Short anecdote, but I've been talking to a founder who was working on brain-computer interfaces and actually developed this technology where, effectively, it's able to read minds. You had to put on some big helmet thing, but once the helmet was on, it could kind of transcribe thoughts. And they were able to get it to work. Now, the founder subsequently shifted focus to the gaming space, doing more work with haptic interfaces. I was asking him like, ""Why didn't you pursue the mind reading tech further?"" And he said to me, ""We couldn't find any great use cases."" Isn't that crazy? But I think, this is tech. Sometimes you can do absolutely remarkable things with technology. But it doesn't matter. It doesn't matter unless you figure out how to appeal to people, and get them to use it, and how to align that technology with an important set of problems. I think that is the thing — as VCs — we need to continue to remind ourselves. Tech is not easy. Tech is not easy, but people are not easy either. Both are really hard. Unlocking new sets of technologies often means that we are granted the opportunity to solve really hard human problems. I guess...TL;DR if GPT-3 starts reading minds. Maybe we'll be able to find some applications for it. But, we'll see. Lukas: Thanks so much, Sarah. That was super fun. Sarah: Yeah, for sure. Bye! Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So, check it out.",9519 Cristóbal Valenzuela — The Next Generation of Content Creation and AI,https://www.youtube.com/watch?v=wbonGgk-_Gk,2426,2023-01-19,"Cris: I think a big mistake of research — specifically in the area of computer creativity — is this idea that you're going to automate it entirely. You see one-click off solutions to do X, Y, or Z. I think that's missing the bigger picture of how most creative workflows actually work. That probably means that you've never actually worked with an agency where the client was asking you to change things every single hour, or make it bigger, make it smaller, right? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Cris Valenzuela is an artist, and technologist, and entrepreneur, and CEO and founder of a company called Runway, which is a maker of ML-powered video editing software. But I feel that description doesn't even do justice to how incredible and innovative his product is. This interview actually starts off with a live demo of his product. I really recommend switching to video if you're listening to this on audio only, because his demo is absolutely incredible. Well, all right, Cris, we don't normally do this, but I thought it would be fun to start with a product demo if you're down for it. You have such a cool, compelling product. Would you be up for that? Cris: Sure. What do you want me to demo? There's a lot I can do. I want to make sure I can focus on what you want to see. Lukas: Well, this is an ML podcast. So I think people would probably be interested in the most flashy ML features. How about that? Cris: In short, Runway is a full video creation suite. It allows you to do things that you might be able to do in more traditional video editing software. The main difference is that everything that runs behind the scenes...so, most of the core components of Runway are ML-driven. The reason for that, it has two main kind of modes or uniqueness about making everything ML-based. One is, it helps editors, and content creators, and video makers automate and simplify really time-consuming and expensive processes when making video or content. There are a lot of stuff that you're doing in traditional software that are very repetitive in nature, that are very time-consuming or expensive. Runway aims basically to simplify and reduce the time of doing this stuff. If you have a video you want to edit, an idea you want to execute, spending the time, and the minutes, and the hours, and sometimes days on this very boring stuff is not the thing that you really want to do. So we build algorithms and systems that help you just do that in a very easy way. And then there's another aspect of Runway, that it's not only about automation, but it's about generation. We build models, and algorithms, and systems that allow our users and customers to create content on demand. > And everything...baseline for us is that everything happens on the browser. It's web-based and cloud native, which means that you don't rely any more on native computers, or native applications, or desktop compute. You have access to our GPU cluster on-demand, and you can render videos on 4k, 6k pretty much in real time. Plus you can do all of this AI stuff also in real time as well. A lot of the folks are using Runway now — CBS, The Late Night Show with Colbert, or the folks who edit Top Gear, or sometimes creators who do stuff for Alicia Keys or for just TikTok or movies — they're all leveraging these AI-things via this web-based cloud based editor. So that's a short, five-minute intro, what the product does and how ML or AI plays a role in the product itself. But I'm happy to now show you how everything goes together and the experience of using the editor, if that makes sense. Lukas: Please, yeah. Cris: Cool. Any questions before we do that? I can double down, or if you want to me to clarify? Lukas: Well, I actually didn't realize that professional video teams like The Colbert Show use Runway. Do they use it for all of their video processing or is there a certain part where they they use it? How does that work? Cris: It depends. Some editors and some folks are using it as an end-to-end tool to create videos. Some other folks use a combination of different softwares to make something. The folks who we use it for movies sometimes add in Nuke or Flame. We have a big Flame community, so Runway becomes a part of that workflow. It's replacing either something you do on a very manual basis. It's sometimes replacing a contractor you hired to make that work for you, or it's sometimes replacing your own work of trying to do it yourself in this old software. But you still use other aspects of it, or other software to combine [with] it. It really depends on the type of content you have and the level of outcomes that you that you need. But we do have folks that use it as an end-to-end content creation and editing tool. Lukas: Cool. Well, I mean the extent of my video editing is basically modifying videos of my daughter to take out the boring parts and send them to my parents. That's as far as I go. Maybe you could sort of give me a little bit of an overview of the cool stuff you can do with Runway. Cris: Totally. You can do all of that in Runway on the browser which is...you might be...you might start using Runway for that. The one thing I would emphasize is, everything is running on the cloud, on the web. You can just open any project with a URL. You can also create teams, and you have this baseline collaboration aspect that just runs out-of-the-box. Cool. Anything else? No, just go demo? Lukas: Yeah, let's see a demo. Totally, yeah. Show me the cool stuff. Cris: Perfect. So, this is what Runway looks like. If you're ever edited video before, it's a very common interface. We have tracks on the bottom. We have a multi-editing system with audio tracks, and keyframe animations, and text layers, and image support. You can preview your assets on the main window and have a bunch of effects and filters on the right. Again, everything running pretty much on the cloud in real time. The idea here is that there are a lot of things that you can do that are very similar to stuff that you can do in other applications, plus there are things that you can't do anywhere else. Let me give you an example of something that a lot of folks are using Runway for. I'm going to start with a fresh composition here. I'm going to click one of the demo assets that I have here. I'm going to click this. I have a surfer, right? On that shot, let's say I want to apply some sort of effect or transformation to the background of this shot. Or I want to maybe replace the person here and take it somewhere else. The way we do that today would be a combination of frame-by-frame editing, where you're basically segmenting and creating an outline of your subject, and every single frame you move you have to do it one more time. For that, we built our video object segmentation model — which we actually published a blog post and a paper around it — that allows you to do real-time video segmentation. In film, this is actually called rotoscoping. You can just literally go here, guide the model with some sort of input reference. I tell the model this is what I want to rotoscope, and it can go as deep as I need. I can select the whole surf layer here at deeper...more control over it. Once the model has a good understanding of what you want to do, it would propagate that single keyframe or single layer to all the frames of video in real time. You get a pretty smooth, consistent segmentation mask that you can either export as a single layer, or export as a PNG layer, or you can use...go back to your editing timeline and start modifying. You said you want to cut it, you want to compose it, you want to do some sort of transformation...from here, you can do that directly from here. Let's say I have my baseline — or my base video — here, I have my mask on top of that, and now I can just literally move it around like this. I have two layers, right, with a surfer. So, something that looks very simple and in traditional software may take you a couple of hours of work, here you can do pretty much in real time. Again, it's something that most editors know how to do, but it just takes them a lot of time to actually do. Lukas: And did you just run that in the browser? Cris: Yeah. Lukas: That segmentation mask, it figured out in the browser and it's calculating all...it doesn't go to the server? Cris: No, it goes to the server. Yeah, there's an inference pipeline that we built that processes real-time videos and allows you to do those things. The compute part is everything running on the cloud. You just see the previews and sometimes — depending on your connection — you can see a downsampled version of it, so it runs really smoothly and plays really nicely. Also, for every single video there's a few layers that we run, that help either guide something like a segmentation mask. For instance, we get depth maps and we estimate depth maps for every single video layer. You can also export these depth maps as independent layers and use them for specific workflows. That's also something very useful for folks to leverage. So you have this and you can export this. Behind the scenes, we're using this for a bunch of things. Lukas: Cool. Cris: Those are one of the things that you can do. You can go very complex on stuff. Let's say, instead of the surfer, I just want the — let me refresh this — I just want the background. I don't want the surfer. I can inpaint or remove that surfer from the shot. So I'm just gonna paint over it. Again, I'm giving model one single keyframe layer, and the model is able to propagate those consistently for the entirety of the video. That's also something we — as a product philosophy — really want to think about. Which is, you need to have some layer of control of input. The hard part of that should just be handled by the model itself, but there's always some level of human-in-the-loop process, where you're guiding the model. You're telling it, ""Hey, this is what I want to remove. Just go ahead and do the hard work of actually doing that for the whole video sequence."" Lukas: Wow, that's really amazing. That's like magic, right there. The surfer’s really just gone. Cris: Yeah. That's something we see a lot, when people find out about it, or when they start using it. ""Magic"" is a word we hear a lot. It's something that...again, if you're editing or you've worked in film or content before, you know how hard, and time-consuming, just painful it is. Just seeing it work so instantaneously really triggers that idea of magic in everyone's minds. Which is something for...that's great, because we've really thought of the product as something very magical to use. So, there's stuff like that. There are a few things like green screen and inpainting — which I'm showing you now — plus motion tracking, that we consider as baseline models in a Runway. Those are just...you can use them as unique tools, as I'm showing you right now. You can also combine them to create all sorts of interesting workflows and dynamics. There's the idea of, ""You want to transform or generate this video, and take this surfer into another location,"" you can actually generate the background, and have the camera track the position of the object in real time, and then apply the background that you just generated in a consistent manner, so everything looks really smooth. The way you do that is by combining all of these models in real time, behind the scenes. You might have seen some of those demos on Twitter, which we've been announcing and releasing. This is a demo of running a few of those underlying models, combined. There's a segmentation model that's rotoscoping the tennis player in real time. There's a motion-tracking model that's tracking the camera movement, and then there's an image-generation model behind the scenes that is generating the image in real time. Those are all composed at the same time. Does that make sense? Lukas: Yeah, yeah. Totally. Cris: Those are, I would say, underlying baseline models and then you can combine them in all sorts of interesting and different ways. Lukas: Totally. Alright, well, thanks for the demo. That was so cool. We'll switch to the interview format. Although now I really want to modify this video in all kinds of crazy ways. Cris: We should replace the background with some stuff while we're talking Lukas: Totally. Get this microphone out. One question I really wanted to ask you is, I think your background is actually not in machine learning originally, right? I always think it's really interesting how people enter the machine learning space. I'd just love to hear your story, a little bit, of how you ended up running this super cool machine learning company. It seems you're very technically deep, also. And so how you managed to get that depth mid-career. Cris: Totally. Long story short, I'm originally from Chile. I studied econ in Chile and I was working on something completely unrelated. But it was 2016 or 2017, I think, and I just randomly fell into a rabbit hole of ML- and AI-generated art. It was very early days of Deep Dream and ConvNets and AlexNet, and people were trying to make sense of how to use this new stuff in the context of art making. There were some people like Mike Tyka, and Mario Klingemann, and Gene Kogan who were posting these very mind-blowing demos. That now feel things that you can run on your iPhone on real time. But around that time it was someone...I remember Kyle McDonald — which is an artist — who was walking around with his laptop, just showing people a livestream of a camera. You had basically...I think with an ImageNet model running in real time, and just describing what it saw. And it just blew my mind. Again, it's 2016. Now it's pretty obvious, but around that time it was pretty special. I just went into a rabbit hole of that for too long. It was too much, I was just fascinated by it. I actually decided to quit my job, I decided to leave everything I had. I got a scholarship to study at NYU and just spent two years just really going very deep into this. Specifically in the context of, I would say, creativity. My area of interest was the idea of computational creativity. How do you use technology? How do you use deep learning or ML for really creative tool-making and art-making? That two-year-long research process and exploration ended up with Runway. Runway was my thesis at school. It was a very different version of what you see now. But the main idea was very much pretty much the same. It's like, ""Hey, ML and AI are basically a new compute platform. They offer new ways of either manipulating or creating content. And so there needs to be some sort of new tool-making suite that leverages all of this, and allows people to tap into those kinds of systems in a very accessible and easy way."" The first version of Runway was a layer of abstraction on top of Docker, where you could run different algorithms and different models in real time on this Electron app. You could click and run models in real time and connect those models via either sockets, or UDP, or a web server to Unity or Photoshop. We started building all these plugins where you can do the stuff that you are able to see now on Twitter. Like, ""Here, I built a Photoshop or Figma plugin that does image generation."" We were building all that stuff running Docker models in your computer locally, and you can stream those. It was 2018, 2019. Lukas: Interesting. It must have been a much more technical audience at the time then, right? If you have to run Docker on your local machine. That's not something everyone can do, right? Cris: Totally, totally. I think that that also tells a lot about how much progress the field has made, and how mainstream and how more accessible things have become. Trying to put this set of new platforms and compute ideas for creators, and video makers, and filmmakers required you to know how to install CUDA and manage cuDNN. I don't know if it's just too much. But people were still wanting to do it. There were some folks who were like, ""Hey, this is really unique. I want to understand how to use this."" But then we realized it wasn't enough. You need to go [to] higher layers of abstraction on top of that to really enable creative folks to play with this, without having to spend months trying to set up their GPU machines. Runway has really evolved, and we have a really experiment-driven thesis and way of working on the product. But it's all about trying ideas and testing them out with people really fast. We're building something that hasn't been done before. And so it's really easy to get sidetracked into things that you think are going to work, or ideas that you think are going to be impactful. But since you're working with new stuff all the time, being close to your user base for us has been kind of really, really important. Every time we iterate on the product, I think one consistent line of evolution has been this idea of simplifying...making higher abstraction layers on top of it. The first versions of rotoscoping or inpainting required you to select the underlying model architecture, and understanding what a mask was, and [how] propagation works. If you're really a filmmaker, you don't care about any of the stuff. You just want to kick once, and you want to get a really good result. For us, it's ""How do you build from there, using what we're building behind the scenes?"" Lukas: Were you surprised how well these approaches have worked to generate images? It sounds you started your work in 2017, 2018. The space has changed so much. Do you feel you saw it coming, or have things unfolded differently than you thought? Cris: I mean, things have definitely accelerated. But I think our thesis — when we started Runway three and a half years ago — was pretty much the same. It was, we're entering literally a new paradigm of computation and content. We're not going to be...we're soon going to be able to generate every single piece of content and multimedia content that we see online. I've been demo-ing generating models for creative use cases for the last three years. What I was showing three years ago, people were like...it was like, ""Hey, this is how it works. This is how you train a model. This is what the outcome of the model is."" Of course, at that time, it was a blurry 100x100 pixels image. Some sort of representation of what you were describing. Most people took it as a joke, like, ""Oh yeah, cool. Very cool. Cool thing."" Or as a toy, like, ""That's a fun thing, right? You kind of use it once. But of course, I will never use this in production."" I remember speaking with this huge...one of the biggest ad agencies in the world, and I was presenting to other executives. Here's the future of content, type anything you want. And something blurry came out and they're like, ""Cool, not for now."" And they reached three weeks ago being like, ""Hey, how many licenses can we get for this, tomorrow?"" Because the models are going just so much better, that it's obvious. It's transforming their industries and a lot other things. I think what has changed for us is pretty much the speed. Now we're entering a really nice moment where things are converging, and there's a good understanding of what's going to be possible, and where things are going. Scaling laws are getting to a good point. And so continuing the same, but the thesis of the company was always built on that this will happen, and it's happening sooner rather than later. Lukas: Do you have a perspective on if this acceleration will continue, or if we just are seeing a breakthrough, and then we're going to need new breakthroughs to get to the next level of quality? Cris: Sure. I think there's definitely more compute that needs to be added to this, more data sets. I think we're still scratching the surface of what it will become. There's still this...I was discussing this with a friend the other day, this idea of a curiosity phase where people are entering the realm of what's possible and coming up with all these solutions and ideas, but there's still a difference between those concepts, and explorations, and ideas and meaningful products that are long-term built upon those. What I'm interested in seeing is how much of those ideas will actually convert over time, over meaningful products. I think that conversion of products is not just pure research or pure new models, there needs to be a layer of infrastructure to support those things. It's great that you can run 1 single model to 1 single thing on X percent. But if you're trying to do that scale on a real-time basis for 10 people, that then use it on a team and depend on it for their work, then there's a slightly different thing. But I think we're about to see way more stuff around video, specifically. I think image might be solved in a couple of more months and video is starting to now catch up with that. It's a really exciting time for that. Lukas: What does something being solved mean to you? Like, you could just get any image that you would ever want or imagine? Cris: Yeah, that's a good one. That's a good question. I would say that I would consider being solved [as] being able to translate something like words or a description into a meaningful image or content that pretty much matches where you're trying to...what you're imagining. And if it doesn't, you're able to control really quickly and easily to get to the point where you can arrive at your final idea. That's why the combination of models really makes sense. It's going to be hard to have a full model that does exactly what you want. For instance, for image generation. I think it's a combination of, you have a model that does the first model, which is you generate something. There's no pixels, you generate the pixels. Second step is, you're able to quickly modify it, or inpainting, or grade it in some way, and start it in some other way. But that whole thing just happens in a few seconds or a few minutes, right? If you speak with anyone in the industry, VFX, or ad agencies or content creation, post-production companies, these are stuff these guys do all the time. This is what they do for a living, right? They're able to create content out of nothing. The thing is just it's really expensive. It's really, really expensive. And it involves a lot of time and rendering and skilled people to get to that point. I think for me, ""solved"" is, anyone can have access to that professional-level grade VFX-type of content from their computers and from a browser. Lukas: Do you ever think about making a version of Photoshop, instead of a video editing software? If you think images are closer to being solved. Certainly I can't go into Photoshop and get exactly the image I want. I love to play with all the image generation tools out there. But I do think they're amazing at first, but then you kind of hit this point where if you really want the image to look like you want, it gets kind of frustrating. It seems there's also room for an image version of what you're doing. Is that something you'd consider doing? Or, why not make that? Cris: Totally. Yeah. The answer is absolutely. I think, a few things. One, I think we're converging more to this idea of multi-modal systems where you can transfer between images, and videos, and audio. I think the idea that we've been...we built software to deal with each media independently. There's audio editing software, and video editing software, and image editing software, and text-based...you have models that can quickly translate between all of those. Content — let's say video — it's a combination of different things. You have images, you have videos, you have audio, you have voice. All of those things are now possible. I think for us, when I think about the product philosophy of Runway, it's less about, ""How do you build a better Photoshop or a better Premiere?"" Fundamentally, these models are just allowing you to do the things that none of those others can do. If you think about marginal integrations of those things...yeah, you build a better Photoshop that has a better paintbrush, or a better contact server tool. But ultimately, when you combine them in new ways, you create a new thing. It's completely new. It's not Photoshop, it's just a new way of making videos, and editing images, and editing audio. All in one, single component or tool. For me, what's really interesting is the multi-modal aspect of things, and translating also into those. And 3D, for instance, it's one of the filters...you're going to start to see a lot of translation between images and videos on 3D. Lukas: Totally. So, I have to ask you your thoughts on deep fakes and things like that. I'm sure everyone asks you that, but I'm really curious what you think about that. Do you think that you would want to put in limitations into your software to not allow certain things? Do you think this is about to change the way we view videos, as this technology gets more standardized and available to everyone? Cris: For sure. As [with] every major technology breakthrough, there's always social concerns about how it might be misused or used in not the right, intended ways. It's a good exercise to look at history to see what has happened before. There's this really good YouTube video about Photoshop when it was first released, I would think about the early 90s. They were like...it's kind of a late night show, and they're discussing the ethical implications of manipulating images in magazines. And they're like, should we allow to manipulate images and put them in magazines? Half of the panel was like, ""No, we shouldn't."" It breaks the essence of what photography is, right? 20 years after that, it makes no sense to think about not doing something like that, right? There's always an adaptation process, I would say, where people need to...we need to collectively ask, ""Hey, how is it going to be used?"" But I think ultimately, you understand what the limitations are, and you also fine-tune your eyes and your understanding of the world to make sense of that thing. Now everyone knows that ""Photoshop"" is a verb that you can use to describe something that's manipulated. You do that same exercise, and you go back in time, and you see the same. When film just started to appear, there was this story, interesting story about...one of the first films that were made is a train arriving to a station. They were like, projecting that on a room. When people saw the train coming to a station, everyone ran away because they thought a train was coming to a station, literally. But then you make sense of it, and you're like, ""Yeah, this is not true. I understand that this is an actual representation of something."" Ultimately, I think with AI and with generated content, we'll enter a similar phase, where it's going to become commonplace and something people are familiar with. Of course, there's going to be misuses and bad uses. Of course, people can use Photoshop for all sort of evil ways. But the 99% of people are just like, their lives have been changed forever in a positive way because of this. Lukas: Interesting. Well, look, I'd love to hear more about your tech stack. This is a show for ML nerds of all types. I think you're doing pretty hardcore ML at scale. What have been the challenges of making this work, making the interface as responsive as it was? What were the key things to scale up your models? Cris: Sure. There's a lot of things that we had to kind of come up [with] creatively, to make this work in real time. On the one hand — on the ML side — we mostly use PyTorch for all of our models. We have a cluster — basically, an AWS cluster — that scales based on compute and demand, where we're running all those models for training. We use sometimes Lighting and, of course, Weights & Biases to follow up and understand better what's working in our model training. Serving, we optimize for different GPU levels or compute platforms, depending on availability. We've made some systems to scale up depending on demand. On the frontend side of things, everything's Typescript and React-based. There are some WebGL acceleration stuff we're doing to make things really smooth. And then the inference pipeline, where we're writing everything in C++ to make it super, super efficient and fast, specifically since you're decoding and encoding videos in real time. We also built this streaming system that passes frames or video frames through different models to do the things that I just showed you. And so we also had to come up creatively with that. That's kind of a big picture of our tech stack. Lukas: One challenge that I'm seeing some of our customers run into — as these models kind of get bigger and more important — is that the actual serving cost of the application increases. Is that an issue for you? Do you do things like quantization? Is lowering your inference costs an important project for you all? Cris: For sure. Yeah, for sure. I mean, we're running...our biggest cost right now is AWS, GPU costs, and inference costs, and serving these models. There are two main areas for sure. We have an HPC, we're doing large-scale training of language models and video models. That takes a lot of resources and time. But just serving on...I would say the tradeoff between precision and speed really matters. Quantizing models is great. But also you need to make sure that you're not affecting the quality of the model because if you're affecting something on a pixel level, it might change the result from being okay to bad. And that might mean user churning. And so, if you're going to spend a few more seconds rendering, that might actually be better. There's always a tradeoff of how much. But yeah, we always try to figure out what's the right balance there. We're still exploring some stuff on the browser. I think the browser is becoming really powerful. The only constraint about the browser is just memory and RAM. And you get...it's a sandbox, so you can't really do a lot of things specifically with video. But you can run some stuff on the browser. And so we would send some things specifically, and convert some things, and make them smooth enough. But I think we're not 100% there yet. Lukas: But you're also training your own large language models and large image models. That sounds like training would be a major cost for you as well. Cris: Yeah, for sure. Retraining some stuff to make sure it works in the domain of what we have is one of our core competences. Now we're training...starting a huge job on our HPC. That's going to take a big percentage of our costs for the next few months. Lukas: Wow. I have to ask. That language interface that you showed me was so compelling and cool. But I have been seeing language interfaces for the past 20 years, and the challenge with these language interfaces is when they don't work, they're just enraging. Actually, you sort of addressed that. Showing how it creates these things, and you can undo them, and you can kind of modify them. Do you feel that that kind of conversational interface is at the point where, for you, it's an interface that you really want to use? Cris: I like to think [of] it as a tool. It's not the sole answer to everything you need. This is not going to be a replacement for all of the workflows in making content, video, images, or sound, or whatever it is. It's just a speed up in the way you can do those kind of things. I think the sweet spot is a combination of both. Being able to have that constant feedback loop with the system, where you're stating something out [and] the system is reacting in some way that matches your idea. And then you have that level of control so you're going the direction you want and doing what you want. Or, if it's not working, you just do it yourself, right? I think a big mistake of research — specifically in the area of computer creativity — is this idea that you're going to automate it entirely. You see one-click off solutions to do X, Y, or Z. I think that's missing the bigger picture of how most creative workflows actually work. That probably means that you've never actually worked with an agency where the client was asking you to change things every single hour, or make it bigger, make it smaller, right? Lukas: Right. Cris: It's hard for me to imagine a world where you have a one-click off solution for everything. That feels boring, to be honest. You want to have that control. I think language interfaces are a huge step towards accelerating the speed at which you can execute. Are they the final answer for everything? I'm not sure, but they do make you move faster on your ideas. Lukas: Did I understand you right that you want to build your own large language model? I would assume you would take one of the many off-the-shelf language models today. Are you actually training your own? Cris: Yeah, I think it's...we are, but it's also the fact that ML...the infra for models and models themselves are becoming commodities. It's great for companies like us, because some stuff we kind of need to build on our own. There's a lot of things in Runway that you won't find anywhere else. But there's a lot of stuff, large language models that you can just use off the shelf. You have all these companies offering similar services. It's a great...as a consumer of those, if we want to use those, it's just a cost situation where whoever offers the best model, we'll use. And to a point, it might make sense to do our own. So yeah, sometimes we don't have to do everything ourselves. You can just buy it off the shelf. But some other times, you just need to do it because it doesn't exist. Lukas: Sorry, large language models you think you might do it yourself, even? Cris: We're doing a combination of both. We're using APIs but also re-training some of our own. Lukas: I see, I see. Have you experimented with all the large models out there? Do you do you have a favorite of the existing offerings? Cris: I think GPT-3 works. I think, actually, the model is Davinci. It's probably GPT-4 by now. I think OpenAI has been making- -right, right. Cris: -that silently behind the scenes, it works really well. That's the one I'd say we're experimenting with the most, and we get the best results. Lukas: Cool. Well, look, we always end with two questions. I want to make sure I get them in. The second-to-last question is, what is a topic that you don't get to work on, that you wish you had more time to work on? Or, what's something that's sort of underrated for you in machine learning right now? I realize it's a funny question to ask an obsessed ML founder. But I’ll ask it anyway. Cris: I think, audio generation. I think it's catching up now, but it's not...no one really has been paying a lot of attention. There's some really interesting open source models from Tacotron to a few things out there. I think that's going to be really, really transformative for a bunch of applications. We're already kind of stepping into some stuff there. But, it's hard to focus as an industry — or as a research community — in a lot of things at the same time. And now that image understanding has kind of been solved away, people are moving to other specific fields. I think one of the ones that are going to start seeing very soon is audio generation. So yeah, excited for that for sure. Lukas: Yeah, I totally agree. Do you have a favorite model out there? We just recently talked to Dance Diffusion, or HarmonAI, that was doing some cool audio generation stuff. Cris: Yeah, there's one — let me search for it — that just blew my mind. tortoise-tts, I don't know if you've seen that one. Lukas: No. Cris: Yeah. tortoise-tts is, I think, the work of just one single folk, James Betker. It works really well and he's been...someone used it to create the Lex Fridman...generative podcast. I'll share with you the audio. It's a whole podcast series that goes every week, where everything is generated. The script is generated by GPT-3 and the audio is generated by tortoise. And you can hear it's like, it's a podcast. You can't really tell. Yeah, really excited for stuff like that. Lukas: Cool. The final question is for you, what's been the hardest part about getting the actual ML to work in the real world? Going from these ideas of models or research to deployed and working for users. Cris: I think these models — and things like image generation and video generation — require a different mental model of how you can leverage this in creative ways. I think a big mistake has been to try to use existing principles of image or video generation and patch them with this stuff. I think, ultimately, you need to think about it in very different ways. Navigating a latent space is not the same as editing an image, right? What are the metaphors and the abstractions they need to have? We've come up with those before, in the software pipeline that we have right now. You have a brush, and a paint bucket, and a context or world tool, and you're editing stuff. But when you have large language models that are able to translate ideas into content, and you navigate and move across specific space or vector direction in ways you want, you need new metaphors and you need new abstractions. What's been really interesting and challenging is, what are those metaphors? What are those interfaces? How do you make sure the systems you're building are really expressive? I think two things that drive a lot of what we do are control and expressiveness. ""Control"" as in you, as a creator, want to have full control over your making. That's really important. How do you make it, so you also are expressive? You can move in specific ways as you are intending to do. So yeah, that's also really...it's really exciting and passionate for us to invent some of those stuff. Lukas: Well, it’s really impressive what you did. Thanks so much for the interview. Cris: Of course, thanks so much for hosting me. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.",6898 Jeremy Howard — The Simple but Profound Insight Behind Diffusion,https://www.youtube.com/watch?v=HhGOGuJY1Wk,4377,2023-01-05,"Jeremy: I’ve been telling everybody who will listen that I feel like we’re in the middle of a significant spike in technological capability right now. And so if you’re not doing that, you’re missing out on being at the forefront of something that’s substantially changing what humans are able to do. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Jeremy Howard is the founding researcher at fast.ai, which is a research institute dedicated to making deep learning more accessible. They make an incredible Python repository that people use for lots and lots of deep learning projects. And they make an incredible set of classes that many people I know have taken, and is almost universally loved. He was also the CEO and founder of Enlitic, the president of Kaggle, and has done a whole bunch of diverse, amazing things in his career. It's always super inspiring to talk to Jeremy and this interview is no different. I really hope you enjoy it. Lukas: You are the first person to be on this podcast two times. And I think you are the most popular guest that we've had, based on our YouTube metrics. So it's great to have you. I wanted to start with, actually...the most memorable part of our interview — for me personally — was the amount of time that you set aside every day to work on just learning. Undirected, sort of learning new things, which I really thought was an amazing thing that I always aspire to do more of. I was curious. Lately, what have you been learning? Jeremy: I'm spending all my spare time at the moment on generative modeling, around the Stable Diffusion or diffusion modeling space. Lukas: Hence the new course, I guess. Is that part of the learning process? Jeremy: Yeah. It’s a chicken and the egg thing. It's partly ""the new course is because of the learning"", and partly ""the learning is because of the new course"". I've been telling everybody who will listen that I like feel we're in the middle of a significant spike in technological capability right now. And so if you're not doing that, you're missing out on being at the forefront of something that's substantially changing what humans are able to do. When there's such a technological shift, it creates all kinds of opportunities for startups, and for scientific progress, and also opportunities to screw up society. Which hopefully you can figure out how to avoid, and stuff like that. I'm very keen to do what I can to be on the forefront of that, and to help others who are interested in doing the same thing. Lukas: When you say ""spike"", do you mean diffusion models specifically or do you mean machine learning more broadly? Do you mean like- Jeremy: -I mean diffusion models, specifically. Lukas: Interesting, interesting. Jeremy: Yeah. It's a simple but profound insight. Which is that it's very difficult for a model to generate something creative, and aesthetic, and correct from nothing...or from nothing but a prompt to a question, or whatever. The profound insight is to say, ""Well, given that that's hard, why don't we not ask a model to do that directly? Why don't we train a model to do something a little bit better than nothing? And then make a model that — if we run it multiple times — takes a thing that's a little bit better than nothing, and makes that a little bit better still, and a little bit better still."" If you run the model multiple times, as long as it's capable of improving the previous output each time, then it's just a case of running it lots of times. And that's the insight behind diffusion models. As you'd be well aware, Lukas, it's not a new insight. It's the same basic insight that belongs to this class of models called ""boosted models"". Boosted models are when you train a model to fix a previous model, to find its errors and reduce them. We use lots of boosted models. Gradient boosting machines in particular are particularly popular, but any model can be turned into a boosted model by training it to fix the previous model's errors. But yeah, we haven't really done that in generative models before. And we now have a whole infrastructure for how to do it well. The interesting thing is that — having started to get deep into the area — I've realized we're not close at all to doing that in an optimal way. The fantastic results you're seeing at the moment are based on what, in a year's time, will be considered extremely primitive approaches. Lukas: Could you say a little more about that? Jeremy: Sure. Broadly speaking, we're looking to create a function that, if we apply it to an input, it returns a better version of that input. For example, if we try to create a picture that represents ""a cute photo of a teddy bear"", then we want a function that takes anything that's not yet ""a really great, cute photo of a teddy bear"" and makes it something a little bit more like ""a cute photo of a teddy bear"" than what it started with. And furthermore, that can take the output of a previous version of running this model and run it again to create something that's even more like ""a cute version of a teddy bear"". It's a little harder than it first sounds, because of this problem of out-of-distribution inputs. The thing is if the result of running the model once is something that does look a little bit more like a teddy bear, that output needs to be valid as input to running the model again. If it's not something the model's been trained to recognize, it's not going to do a good job. The tricky way that current approaches generally do that, is that they basically do the same thing that we taught in our 2018-2019 course, which is what we call ""crap-ification"". Which is, to take a perfectly good image and make it crappy. In the course, what we did was we added JPEG noise to it, and reduced its resolution, and scrolled[?] text over the top of it. The approach that's used today is actually much more rigorous, but in some ways less flexible. It's to sprinkle Gaussian noise all over it. Basically, add or subtract random numbers from every pixel. The key thing is then that one step of inference — making it slightly more like a cute teddy bear — is basically to ""Do your best to create a cute teddy bear, and then sprinkle a whole bunch of noise back onto the pixels, but a bit less noise than you had before."" That's, by definition, at least going to be pretty close to being in distribution, in the sense that you train a model that learns to take pictures which have varying amounts of noise sprinkled over them and to remove that noise. So you could just add a bit less noise, and then you run the model again, and add a bit of noise back — but a bit less noise — and then run the model again, and add a bit noise back — but a bit less noise — and so forth. It's really neat. But it's like...a lot of it's done this way because of theoretical convenience, I guess. It's worked really well because we can use that theoretical convenience to figure out what good hyperparameters are, and get a lot of the details working pretty well. But there's totally different ways you can do things. And you can see even in the last week there's been two very significant papers that have dramatically improved the state of the art. Both of which don't run the same model each time during this boosting phase, during this diffusion phase. They have different models for different amounts of noise, or there are some which will have super resolution stages. You're basically creating something small than making it bigger, and you have different models for those. Basically, what we're starting to see is that gradual move away from the stuff that's theoretically convenient to stuff that is more flexible, has more fiddly hyperparameters to tune. But then people are spending more time tuning those hyperparameters, creating a more complex mixture of experts or ensembles. I think there's going to be a lot more of that happening. And also, the biggest piece I think will be this whole question of, ""Well, how do we use them with humans in the loop most effectively?"" Because the purpose of these is to create stuff, and currently it's almost an accident that we can ask for a photo of a particular kind of thing, like a cute teddy bear. The models are trained with what's called ""conditioning"", where they're conditioned on these captions. But the captions are known to be wrong, because they come from the alt tags in HTML web pages, and those alt tags are very rarely accurate descriptions of pictures. So the whole thing...and then the way the conditioning is done has really got nothing to do with actually trying to create something that will respond to prompts. The prompts themselves are a bit of an accident, and the conditioning is kind of a bit of an accident. The fact that we can use prompts at all, it's a bit of an accident. As a result, it's a huge art right now to figure out like, ""trending on art station, 8k ultra realistic, portrait of Lukas Biewald looking thoughtful,"" or whatever. There's whole books of, ""Here's lots of prompts we tried, and here's what the outputs look like"". How do you customize that? Because, actually, you're trying to create a story book about Lukas Biewald's progress in creating a new startup, and you want to fit into this particular box here, and you want a picture of a robot in the background there. How do you get the same style, the same character content, the particular composition? It's all about this interaction between human and machine. There's so many things which we're just starting to understand how to do. And so, in the coming years I think it will turn into a powerful tool for computer-assisted human creativity, rather than what it is now, which is more of a, ""Hand something off to the machine and hope that it's useful."" Lukas: Do you think the same approach applies across domains? Or is there something about images — the way it's sort of obvious how to add noise — and maybe the data set that we have? I mean, certainly the way you described diffusion, there's a natural application to that to almost any domain, but- Jeremy: Correct. Lukas: -I guess Gaussian noise on text, it's a little unclear to me what that really means. Maybe it’s like... Jeremy: So, last week a paper showing diffusion for text came out. There's already diffusion models for proteins. There's already diffusion models for audio. The audio ones use — or some of them — use a fairly hacky obvious but neat approach of using diffusion to generate spectrograms — which are images — and then having something like a super resolution model. But it's not doing super resolution, it's doing spectrogram to sound. So yeah, these things are already starting to exist. They haven't had as much resources put into them yet, so they're still not that great. But yeah, that's the thing, Lukas, this is not just images at all. It'll be used in medicine, it'll be used in copywriting. The way we currently do generative text models, again, it's kind of a happy accident. When I did ULMFiT, the whole reason I created a language model was for the purpose of fine-tuning it to create a classifier. GPT then took that idea and scaled it up with Transformers. What Alec Radford was trying to do there was not ""generate text"", but try to solve other problems by fine-tuning it. There was this kind of discovery, almost, with GPT-3 that when you take this and you scale it far enough, it actually starts generating reasonable-sounding text. But the text is not necessarily correct. In fact, it's very often wildly incorrect. It'll...intentionally working on text generation approaches which are specifically designed for generating text is something that there's a lot of room to improve. Generally speaking, the way I see it is this. You've got a generative model that's trying to do something difficult and it's pretty good at it, or at least better than nothing. It'll be better at it if you can do it in a way that it runs multiple times during inference, because you're giving it more opportunities to do its thing. I think that means that these multi-step inference models — which may or may not be diffusion models, but kind of boosted generative models — are here to stay. Because no matter how good your generative model is, you can always make it better if you can find a way to run it multiple times. Lukas: I guess that is a good segue to another question I had, which is I think one of the really fun things about deep learning in the early days was it was so tangible. You have this fantastic class, where you can just kind of build these models and see how they work and play with them. I think we both have a very similar learning approach. But, one thing I've personally been struggling with, honestly, with these bigger models is just actually engaging with them in a meaningful way. It's fun to run the various image-generating models, but it feels kind of daunting. I'm not sure I have the money myself to buy the compute to make one that really works. We actually had one person on this podcast who did it for fun — Boris — which is a super fun episode, and I felt really jealous of how much fun he had building it. I'm curious how you turn that problem into something tractable, that you can actually engage with. Jeremy: Yeah. Well, Boris is one of our alumni. He's part of our fastai community, and he showed what is possible for a single, tenacious person to do. Lukas: Although I think Google donated like a hundred thousand dollars of compute to him. So it wasn't totally... Jeremy: Yeah, absolutely. If you can show that you're doing useful work, then there's plenty of compute out there which you can get donated to. But having said that, what he was largely trying to do — at least at the outset — was to replicate what OpenAI had done. I take a very different approach, which is I always assume that the best thing out there right now is far short of what the best thing could be. That in five to ten years time, there'll be something better, and I always look for improving that. So yeah, you should take our new course, Lukas- Lukas: I would love to. Jeremy: -which we're in the middle of, because what I've been working on is exactly what you describe. Which is, how to train and play with a state-of-the-art image-generative model in a notebook on a single GPU. As with all of these things, the trick is to start with an easier but equivalent problem. I'm doing all my work — just about — on the Fashion-MNIST dataset. Which, rather than being 512x512 pixel images of literally anything in the world, including artworks, in three channels, Fashion-MNIST is 28x28, single-channel images of 1 of 10 types of clothing. I always tell people — whether you're doing a Kaggle competition, or a project at work, or whatever — the most important two steps are to ""Create a rapid feedback loop where you can iterate and test fast"", and to ""Have a test which is highly correlated with the final thing you're going to be doing."" If you have those two things, you can quickly try lots of ideas, and see if they're probably going to work on the bigger dataset, or the harder problem, or whatever. It turns out Fashion-MNIST basically...I've kind of replicated a bunch of different approaches in the literature on Fashion-MNIST. The relative effectiveness of those different approaches on Fashion-MNIST mirrors basically exactly their relative effectiveness on COCO, or ImageNet, or LAION, or whatever. Lukas: Cool. Jeremy: But I can train a model on a single GPU to a point where I can see relative differences in about two minutes. Lukas: Wow. Jeremy: And that means I can very rapidly try things. I've started building notebooks where I show every single little step. And also, it helps a lot to use notebooks, which almost nobody working in the generative modeling field seems to be doing at the moment. What they do, is they have...the normal approach is to do ImageNet 64-pixel or CIFAR 32-pixel — which is still better than doing 512x512 LAION — but it still takes...ImageNet 64-pixel takes many hours on an 8-GPU machine. You can't do a fast iteration loop. In a notebook, I can run a single iteration of diffusion. I can see what the outputs look like because the pictures are all there in front of me. If you're not using this kind of approach, instead you're switching back and forth between a terminal, and then you need some way of actually viewing the images. And given that you're probably not sitting directly on that 8-GPU box, you're probably SSH-ing into it. So, now you've got to find a way to show those pictures. There are ways, by the way, of showing pictures in the terminal. For example, if you use iTerm2 there's something called imgcat. If you use other terminals, they probably support something called sixel, sixel graphics. But there's...they're not going to be as a good exploration environment for the kind of stuff than a notebook is. I think there's lots of opportunities for people like you and me to play in this field. I mean, I know there is because I've started spending time talking to some of the folks who were the primary researchers responsible for the key components of Stable Diffusion. And I'm already telling them things that they hadn't thought of before, by virtue of weird little experiments I've done with Fashion-MNIST on my single-GPU Jupyter Notebook. Lukas: Yeah, that makes sense. A fast feedback loop is so important. That's very cool. I was curious, broadly, if you have though on Stable Diffusion in general. We're sitting here in November 2022, and I think they've done an amazing job of bringing awareness to generative models. What do you think about Stable Diffusion? Jeremy: It's been great for progress in the field, clearly. Generally speaking, I'm all about democratization and accessibility, as you know. I don't love the fact that before Stable Diffusion was released, a small number of people in the world had access to the full generative models. And then other people could pay for cut-down versions of them, use them in small quantities. The thing is, accessing these things through a web-based API is extremely limiting. When you've actually got the weights, you can really play with both the engineering and the artistic side of doing things that no one's done before. So yeah, I think that's great. I think it's important. I think — as with any of these things — you release a new, powerful technology out there and a whole bunch of people are going to be using it for, you know, not necessarily the things that you would have chosen to use it for. For example, for Stable Diffusion, it seems like a very large percentage of people who are using it to generate lots and lots of images are doing it to generate anime and specifically nearly entirely...very young women with very few clothes on, anime pictures. I'm sure there are people out there who are taking the clothes off entirely. That happens, I guess, with any technology. I don't necessarily have...I mean, I guess you can't stop that happening. But we certainly need appropriate laws around at least making illegal things...make sure the things that we don't want to be legal, are in fact illegal. But yeah, there are obviously huge benefits. And you're not going to get stuff like protein diffusion models, or pharmaceutical diffusion models...none of those are going to develop if the technologies are in the hands of two or three big organizations. So it's certainly a very valuable step on the whole for society to have this stuff as open as possible. And to be clear, it was all trained at universities. The main one, most of the stuff we're using now for Stable Diffusion was trained in Germany, at German academic institutions, using donated hardware. Lukas: I guess it's interesting though that it was, I think, primarily ethics and AI considerations that made folks like OpenAI restrict access to their models. Or at least that's what they said. Do you think that you would know a priori that that was the wrong thing to do? Would you have pushed against that at the time? Jeremy: I actually wrote a blog post about that back when GPT-3 was just announced, and not released. Nearly universally, the feedback — at least from the AI community — was, ""Oh, this is lame. They're just doing it for profits."" In my blog post, I said, ""Well, not necessarily. There are genuine things to be thinking about here."" Which is not to say that that means that the motivation wasn't at least partially profit-driven. It might well have been. It's certainly convenient that the ethical considerations read in this way entirely align with profit-driven motives as well. But, like I say, it doesn't necessarily mean they're not true. And I'm pretty sure it's for both reasons. If you look at the way OpenAI has behaved since then, they've behaved in a way that is very increasingly apparently profit-driven. So, I'm less generous in my interpretation now than I was then, based on their continuing patterns of behavior. I think also with the benefit of hindsight, it feels a lot more like, in the last couple of years, companies keeping models to themselves, the main impact that ends up being is to create a bigger bifurcation between haves and have-nots in terms of capability. Requiring more researchers to pay for API access to do things, a decreased amount of openness, and in fact even what could be argued as being kind of deceitful behavior. For example, we now know that the OpenAI models that you can pay to access are actually not the same as what's been described in their research papers. We've now had dozens of people write research papers comparing various work to the OpenAI models, and now we've learned that actually we're not comparing to what we thought we were comparing at all. You know, thousands of hours of researcher time being wasted and papers being published with what turns out now to actually be totally wrong information. I'm definitely more enthusiastic about the idea of being open than perhaps...more confident about that than I was a couple of years ago. Lukas: Do you have thoughts on the language side of things, like large language models? Do you think that...for example, do you think that prompt engineering is headed to be an important way of doing machine learning? You do see these models doing incredibly well in a wide variety of NLP tasks. Better than models trained specifically on these specific tasks, sometimes. Jeremy: Yeah. I think generative text models have both more opportunities and more threats than generative image models, for sure. Like I say, they're kind of...the fact that they work at all is in some ways a bit of an accident. They're far, far, far from being optimized for purpose at the moment. But they're already amazingly good, particularly if you do this kind of stuff where literally there are now dozens of papers. ""Just look at what kind of prompts happened to work on these models that we kind of accidentally made generative models,"" ""let's think step-by-step"", and whatever else. We're starting to find ways to actually get them to do a little bit more of what we actually want them to do. But so far we're using really, really basic things. You know, all this ""instruction tuning"". So, rather than just feeding it the entire internet, let's actually fine-tune it with some examples of things that are actually correct info, that actually represent outputs that we would want for these inputs, rather than just whatever somebody rando wrote on the internet 25 years ago. My worry is...I'm much more worried about misuse of text models and image models, because it wouldn't be at all hard to create a million Twitter or Facebook or whatever accounts, and program them to work together to impact the world's discourse in very substantial ways over time. And nobody would know. We could have...on Twitter, for example, some fairly small number of accounts — often where nobody actually knows the human who's behind it — can have very substantive effects on what people are talking about, and how people talk about that thing. Imagine a million of those accounts, which were actually bots that had been trained to be more compelling than humans — which already for years, we've had bots which humans rank as more compelling than actual humans — and that they've been trained to work together. You know, ""Take alternate points of view in exactly the right way,"" and this bot gradually gets convinced by that bot, and whatever else. It could cause a very small number of people in the world to programmably decide how they want humanity to think about a topic, and pay to make that happen. Lukas: Although if I remember right, it seemed like all of fast.ai's sort of broad mandate was to basically make a no-code interface into machine learning, so anyone could access it. And it does sort of seem like prompt engineering — to the extent that it works — is like a huge step in that direction. Isn’t it? Jeremy: Right. Yeah, that's what I'm saying. That's why I said it's both got more opportunities and more threats. The opportunities are vast. Take, for example, the recent thing that was released last week or so, explainpaper.com. Where our students are already...so, with our course we look at a paper or two each week. Last week I had told the class, as homework to re-implement the diff edit paper. Students were saying like, ""Oh, I didn't understand this paragraph. So I highlighted it in explainpaper.com, and here's a summary it gave, and that's a lot more clear now. And then I tried to understand that bit, so I asked for more information."" This is very, very valuable. I saw somebody on Twitter a couple of days ago saying they don't really use Stack Overflow anymore, because they created this tiny little, simple little script called ""ask"" where they type ""ask"" and then something as a prompt — sorry, in the bash shell repl — and it would feed that off to OpenAI GPT-3, and return the result, and they basically use that instead of searching the internet nowadays. Lukas: Wow. Jeremy: Yeah. People are definitely using this stuff and it's going to get much, much better. Lukas: Do you have a clever way — like with Fashion-MNIST and image generation — to play with large language models on kind of a bite-sized scale? Jeremy: Not yet, no. I'll get to that, maybe, in another part of the course, I guess. It's definitely a great question and something to think about. Lukas: Interesting. Okay, a question that I need to revisit — because this is unexpectedly, I think, one of the reasons that so many people listened to my interview with you last time — you sort of made an interesting comment that you felt like Python wasn't the future of ML. You sort of said maybe Julia is the future of ML, and that really seemed to strike a chord with the internet everywhere. I think it's kind of the most-discussed part of Gradient Dissent of all time. So, I'm just curious. Do you have any more thoughts on that? Do you still believe that Julia is the future? You were sort of on the fence about that. Jeremy: I was on the fence about that last time we spoke and- Lukas: Totally. Jeremy: -I would say I'm a little less bullish than I was then. I feel like the Julia ecosystem and culture, it's so focused on these HPC, huge compute, running things on national lab machines. It's all stuff that's very appealing to engineers. It feels good, but it's such a tiny audience. I don't care about whether I can run something on 5,000 nodes. I just want to run it on my laptop. And it's still not great for running on my laptop, really. And it's not great for creating software that I can send you. I can't...if I created a little CLI tool or whatever, well, it's not great for creating little CLI tools cause it's so slow to start up. And then how the hell am I going to send it to you to try out? It'd be like, ""Okay, Lukas. Well, install the entirety of Julia, and then run the REPL, and then type this to go into package management mode."" And then, ""Okay, now you've got this thing and now you can run it."" It's like, okay, that's not going to happen. Or even just deploying a website, it's a lot of fuss and bother, and uses more resources than it should. It's still got that potential. But...I guess the other thing that's become more clear, though, in the last couple of years is their grand experiment on type dispatch...it is more challenging to get that all working properly than perhaps I had realized, because it's still not really quite well working properly. Good on them for trying to make it work properly. It's a vast research project. But there's a lot of weird little edge cases and trying to make that all run smoothly is incredibly challenging. I suspect...something needs to replace Python, but maybe it's something that doesn't exist yet. Partly though...what we're seeing instead...everybody knows we have to replace Python. So, what instead's been happening is we're using Python to create non-Python artifacts. Most obviously JAX. JAX uses Python — or a subset of Python — with a kind of a embedded DSL written as a library. Which only lets you create things that can be expressible as XLA programs, and then XLA compiles that to run fast on a TPU That works pretty well. It's very challenging, though, for research, or hacking, or learning, or whatever, because it's actually not Python that's running at all. So it's extremely difficult to profile — and debug, and so forth — that code. Very hard to run it really nicely in notebooks. In our little team working on diffusion models, we kind of all want to use JAX. But every time we try, it's always...because like everything I write is always wrong the first 14 times. And with Python, you know, I have 14 goes at making it better by finding all the stupid things I did. By running one line at a time, and checking things, and looking at pictures. With JAX, I wouldn't know how to fix my broken code, really. It's difficult. Lukas: But you don't think that that flexibility is fundamentally in conflict with making a language performant? I think we covered this last time. Jeremy: It is for Python. It is for Python, I think. For Python, that flexibility is to be able to actually run it as Python code. If you look at where PyTorch is going now, they've got this TorchDynamo stuff where they're working...they basically can interface with nvFuser, and you can interface with Triton, the OpenAI compiler-ish thing. I'm not sure exactly sure what you'd call it. Clearly PyTorch is heading the same direction as JAX. Which is, if you want it to run fast, you'll use TorchDynamo, or whatever it ends up being called. That's actually now integrated into the PyTorch tree. That's clearly where we're heading. And again, you end up with...probably you'll be using Triton. So you end up...Triton's amazing. Super cool, super fantastic. But you still end up with this thing that's running compiled code. It's not the same code you wrote, but a version of it. More difficult to hack on. If you look at how this works, there's a whole world of software that's written in languages which were explicitly designed to work this way. They're compiled languages. Languages like C++, and Swift, and Rust. They have something very nice, which is they have flags you can pass the compiler. You can pass that the -d flag to run it in the debugger, or you can pass the -o flag to run the optimized version. Basically, you get to choose how close the code that's actually running is to the actual lines of code that you wrote. So that for debugging, you can actually...it'll run slower, but it's actually running the lines of code that you wrote. And I think we want something like that, something that, ""Yeah, it looks like Python. It's pretty compatible with Python. You can still run it as Python, but you can also run it in an optimized way."" Maybe something that actually takes better advantage of these kind of type hints that we can provide. That's my guess. What's going to happen is we'll see Python-esque languages...we'll continue to see these Python-esque languages appear, that may begin to look less and less like pure Python, and are designed to work better and better with these backend linear algebra accelerators and compilers. Lukas: Is there some language out there right now that that has that feel for you? Jeremy: No, they're all basically these embedded DSLs. Like TVM or like Halide. We have the MLIR project, which is kind of providing the backend needed for these kinds of things. Chris Lattner has a new company, which presumably is going to be placed better than any other to create what we need for this kind of thing. He's the guy behind MLIR. It feels like a big open area to me, at the moment. Lukas: Interesting. Okay, on a totally different topic — that I kind of can't believe we didn't cover last time, I feel like we must have been right in the middle of it — I think I, along with many other people in the world, watched you advocate for wearing masks in the early days of COVID. I think you had some of the most high-profile articles on this — like the second-most popular on Preprints — and I was just kind of curious if you could sort of tell that story from your perspective. And maybe what you were seeing that other people were missing, and how you were kind of approaching that problem differently. Jeremy: It's hard for me, Lukas, because I don't understand why — and I still don't understand why — it's not reasonably obvious to everybody. Like, what's everybody else missing and why? Because from my point of view...well, okay, let me go back. So, February 2020 — mid-ish February 2020, late February 2020 — I had a course coming up at the University of San Francisco that I was going to be teaching. I had heard increasing chatter about this Chinese virus thing. What then happened was it hit Italy, and there was a lot more information in English about what was happening in Italy, than there was what was happening in China. So it suddenly was much more accessible to see what was going on, particularly because a lot of the Italian doctors were actually on Twitter and stuff, so you could read what was happening. A whole bunch of people were saying like, ""This is a disaster"", ""The president of the Italian medical body just died of COVID,"" and, ""There's not enough hospital beds."" I knew it had kind of just started to get detected in New York. I thought, ""Oh, well, it seems like it might be quite likely to come here. What does that mean for our course?"" Not at all altruistic. Just, like, are we still going to do our course? My wife and I kind of started reading about it to try to figure out what should happen with the course. And as we did, we were...yeah it was very obvious that it was going to be a global pandemic and it was going to sweep through San Francisco within weeks. And so like within two days, I wrote an email to everybody who had registered to the course, and put out a blog post, and said we're not doing the course live. We're going to do it virtually. This is well before our university — or I think any university — had decided to do that. Which again, I already thought was weird. Like I thought, ""Okay, it's not yet here, but obviously it's going to be. So why are people acting as if it's not going to be?"" Rachel and I ended up writing a long blog post. We were kind of like, ""Okay, it's not just our course."" We've got all these friends in San Francisco who are doing things that we're pretty sure they're going to look back on in hindsight and think, ""That's a terrible idea, because I put myself and my community at risk."" So we said...we didn't know much about it, so we just said, ""Look, as data scientists, here's what we can see so far in the data. It does seem to grow exponentially, at least at first. And, you know, this is the impact it's been having in Lombardi. Here's the early impact in New York. Here's how the math of these kinds of things work. Here's not just a prediction, but an almost certainty as to what's going to happen here."" That got a lot of attention. We had no idea how to avoid it ourselves. We were worried that...historically, when there is global pandemics, it can lead to violence. It can lead to societal disharmony, or whatever. We decided to get out of San Francisco for a while. We also...it was clear that there was going to be a lockdown at some point because, I mean, why wouldn't there be? Again, none of our friends seemed to believe any of this is going to happen. It's really...I thought it was weird, it just seemed very obvious. And then yeah, there was a lockdown like a week or two later. We had told our daughter's school, ""Oh, there's probably going to be a lockdown."" They sent back this rather annoyed email about interrupting learning or something. The schools were closed for a year in the end, in San Francisco. Then we were like, ""How do we not get COVID?"" Because we probably don't want to get COVID, because it seems like getting COVID can be bad. We started to hear from people who would like...saying maybe there could be longer-term implications of some of these kinds of SARS viruses. So I started looking into how it spread, and I discovered that there's all these countries around China that had avoided getting hit by COVID. Particularly Hong Kong, that's literally a train line away from Wuhan. And that just seemed amazing, you know. That's when I discovered that Mongolia, Taiwan, and Hong Kong all had this either universal mask policy or universal mask usage, kind of culturally. And I thought, ""Oh, that's weird."" Because I thought masks were this kind of weird thing. For some reason, you go to Chinatown, you see people wearing masks and that's how it's is, and that's weird. I didn't give much notice of it. But then I started learning it was this respiratory infection, and it kind of started to make sense. I wrote something in the Washington Post talking about how in the Czech Republic, particularly, the populace had independently decided to wear masks, heavily driven by a popular science YouTuber. Basically, within like three or four days, the whole country had made enough masks for everybody, and their president was talking about how proud he was. Again, their infection was going the opposite direction to other countries, I thought that was interesting. So yeah, I kind of wrote an article about that. I talked to a guy who used to be very high up in the government on the science policy side, and I asked him what's going on with masks. He said like, ""Well, nobody thinks there's very convincing science about it."" He said if you want to convince people to wear masks, then you need to find some better science. So I contacted basically the 18 smartest scientific researchers I knew, everybody from Lex Fridman to Zeynep Tufekci and said — not just scientific researchers, in Zeynep's case a sociological researcher — and said like, ""Do you want to help me put together the evidence?"" That's where our paper came from. Basically, everybody said yes, they all agreed. Suddenly we had this huge author group, so we kind of set up a Slack channel. None of us had a really strong opinion going in. Had one of the world's best aerosol scientists, he was probably the strongest opinion going in because this is his job. He was like, ""Well, let me explain aerosols to you."" Then what happened was there was this amazing couple of papers that actually used this laser-scattering light chamber thing to actually literally take videos of respiratory particles suspended in the air. Not suspended, but they just float in the air. It showed that they float in the air for up to an hour. And it showed that when somebody wears a mask, they don't appear. That was the point where I went from ""curious and interested"" to ""100% convinced"". Because it'd be like if somebody said, ""I promise you, Lukas, if you throw this ball at that wall, it won't bounce off. It will go through."" You'd be like, ""Well, Jeremy, I'm not sure. But I'll give it a go."" And you throw the ball at the wall, and it bounces off, and you go like, ""Jeremy, I am very sure you're wrong about your theorem."" And that's how it was with masks. There were people who said masks don't provide respiratory protection from these airborne particles, and then here's a video of them not going through the mask. I was like, ""Okay, that's...I don't need any RCTs. There's a video. There's a picture of it working."" I kind of went all in on just trying to say to people, ""No, there's actually a thing that stops the thing that infects us. So we should wear them."" I found it extraordinarily bizarre that everybody didn't just go, ""Oh, look at that video of it working. Therefore, it works."" It was a super frustrating experience. I don't...there's nothing I enjoy about researching masks and there's nothing I enjoy about political advocacy. The former is boring and the latter is stressful. But when there's something that so obviously can save millions of lives — and also can avoid who knows what long-term harm — it just seems absolutely ethically required to act on that. I spoke with all kinds of world leaders, and politicians, and celebrities, and whatever. In every jurisdiction, it was like a whole new conversation. It was like, ""Talk to people in South Africa; 'Oh, we don't believe in masks.'"" It was like, ""Talk to people in London; 'we don't believe in masks'. Talk to people in Australia; 'we don't believe in masks'. Talk to people in Florida; 'we don't believe in masks.'"" Each one, I discovered this horrible thing. Which is everybody decided they didn't believe in masks until their personal jurisdiction got hit hard by COVID. Wntil the hospital started filling up. And then they would get back to me and say like, ""Oh, tell me more about this mask thing, Jeremy."" That was infuriating because of course the answer is, ""Well, if you had put in mask mandates two months ago, then this wouldn't have happened. Now it's too late because masks can reduce R by a bit, but not enough to reverse a full-on pandemic, once it's there."" Honestly, it...I got really burned out by the process. In some ways it was successful, but in the end, the pandemic still happened. And in the end, I'm still flabbergasted, particularly now that high-quality medical masks are widely available. Demand is so low that factories have been shutting down. I've never had COVID. Literally nobody I know who has worn a high-quality mask at all times indoors, none of them have got COVID. And everybody I know who doesn't, have all had COVID. There's a point at which you kind of say, ""Okay, I've done what I can. You do you."" Lukas: So you continue to wear a mask indoors, at all times? Jeremy: Of course. Yeah. Lukas: What would change...when would you stop wearing a mask indoors? Jeremy: I suspect it's the same as the answer to the question, ""When would I stop drinking clean water?"" I'd rather keep drinking clean water. We decided...I mean, remember, it took decades — even after the John Snow experiment — for big cities to decide to invest in clean water infrastructure. Presumably after some number of years, we will invest in clear air infrastructure. China's already done it. They now have, I believe, HEPA filters in pretty much all their public buildings, and they're putting in UV sterilization in pretty much all their public buildings. Hopefully, at some point, the West will do the same thing and then it'll be like, ""Okay, I'm in an environment with clean air,"" so I don't have to self-clean the air. That'd be one option. Another would be...again, China's ahead of us on this. They have nasal vaccines, which are probably much more effective. If we eventually get those, I think they can actually make a significant dent on transmission. The injected vaccines don't make much of a big impact on transmission. So yeah, there are technologies that should allow us to be able to be pretty safe in indoor spaces. Lukas: But you don't wear masks in an outdoor space? Is that the... Jeremy: No, I mean, it's not exactly a hard and fast rule. We went to a birthday party recently, for example, where it was a karaoke thing. It was outdoors, but all the kids were singing, and they were tightly packed, and whatever. So, our family wore a mask because there's a high amount of aerosolizing activities going on with a high density of people. But yeah, broadly speaking, I'm not too concerned about outdoors because the airborne particles disperse much more quickly. Lukas: I see. I guess the interesting thing about that story maybe is that there maybe was a fairly broad scientific consensus, but no one was really ready to advocate for it. Is that a better summary of what was happening? If you got all these scientists together and they actually all agreed with what you were saying... Jeremy: They didn't, unfortunately. What happened was it was highly polarized by areas. The people that actually understood this are the aerosol scientists. And the aerosol science community was basically 100% all on the same page. Like, ""Talking, breathing, these are aerosolizing activities. We have loads of evidence that this is transmitted through aerosols. We have loads of evidence that in the droplet nuclei — that are suspended in the air — masks block those from getting to your lungs."" All those were pretty much understood in that community. But then the challenge is, Lukas, that we haven't had a major respiratory pandemic in the West, really, since the Spanish flu. So, none of our infectious disease community has any background in that. I spent a lot of time advocating — including speaking directly to the WHO's infection control groups, the folks who kind of ran the response at the WHO — and they were overwhelmingly people who had a background in infectious diseases that was bred through contact. The kind of stuff that hand washing helps with. So they were just coming from a totally different direction, and had decades of experience on treating different kinds of diseases in a different way. They were doing their best to learn and understand. But for some, that was a very difficult experience. One in particular, John Conly, his financial stake was very high in this fomite transfer. That transmission is not through the air, but by contact, because he has financial interests in that being the case. So, very difficult for him to come to terms with the idea that this is a respiratory infection, through respiratory particles, requiring respiratory protection. That was a big challenge, this worldview difference between different scientific groups. The aerosol scientists, there were actually none of them on the WHO's infection protection committee...infection control, whatever it was. I noticed — when I was talking to WHO — it was a total lack of diversity. Every single one had the same kind of academic background, and the same way of thinking about things, and they all knew each other very well. They were also...being involved in the WHO is a very strong status signal in their career, so everybody wants to be invited to those kinds of things. And so you really want to have all the other people on the committee think you're a good, nice person. It creates this real monoculture. So that was another big part of the problem. It was all...it definitely made me a lot more cynical than I was before it, to see how the WHO works. And even our big paper, how to get it published. It took a year from being written to being published. By the time it was published, it was basically too late. The process of getting it published was much more about politics than about science, you know. It was disappointing for me to discover that systems that I had thought of as being very much focused on rationality and data and correctness and rigor...so much of it turned out to be about politics, and networks, and stuff. I guess I was probably pretty naive before all that happened. Lukas: My sense is that people broadly believe that masks reduce the spread of COVID at this point. I'm not sure that I know exactly to what degree...it sounds like you're saying to a really massive degree. But I think you had a part in that. Or maybe just...I just follow you on Twitter and we were just watching you talk about it. But I don't know. It does seem like it’s the mainstream... Jeremy: Yeah, I mean, I was leading the Masks4All group globally. We were the most substantive group doing that. Absolutely. Lukas: It feels like it was successful, though. I mean, I just...do you not... Jeremy: It was successful-ish. If you're in San Francisco, it'll look more successful than if you're in Australia, for example. In Australia...from time to time, we've had mask mandates and everybody wears them when they're told to. The rest of the time, it's strongly recommended, but nobody does. But in San Francisco, I'm told maybe 30% of kids at schools — or some schools — are wearing them. It's definitely...it's disappearing. And also people — a lot of people, maybe most people — I see wearing masks, at least in Australia, are wearing masks that don't work very well, even though the good masks are really easy to get. And a lot of people don't realize like if you get a high quality N95 respirator, you could wear that as many times as you like, until the straps wear out. A lot of people think, ""Oh, you can only wear it once."" A lot of people think it has to be fit-tested. A lot of people think it's like donning and doffing is some complicated thing. There's all this wrong information out there. And so the number of people actually wearing high-quality masks is...to me, it's surprisingly low. If everybody wore one whenever they were indoors, I think we might...particularly if we also had HEPA filters in indoor spaces, I suspect we would be done with a virus, that it would go away. Because how would a respiratory virus continue to transmit when you break the flow of respiratory particles? Yeah. I mean, even in China. All the pictures I see, everybody's wearing surgical masks. It's, like, weird to me. Lukas: Interesting. Well, look, we're almost out of a time and we always end with two questions. But you're a little bit of an unusual guest, I don't know exactly how all these will fit your worldview. We like to...I like to ask people, if you had some extra time to research something completely different, what might it be? I feel like you are just an unending font of this stuff. What are some things that you're interested in that you haven't had time to look into? Jeremy: Well, I'll answer a slightly different question because any time I'm interested in researching something, I just do. Lukas: Fair enough. Jeremy: The most recent thing I spent a lot of time researching is children's education. Our daughter missed the first year of school. Because of COVID, in San Francisco they were closed. That would have been her kind of transitional kindergarten year, as they call it in California. Then we came to Australia, and so she went to school — regular school — for the first year here. She was straight into grade one. She enjoyed it. She was always happy to go, and happy to stay there. But it felt like she had blossomed a lot more during her previous year when she was doing stuff over Zoom, and on apps, and stuff than the year that she was in-person in the classroom, which really surprised me. Instead, she had become much more of a perfectionist and was becoming much less resilient after her year at physical school. That all seemed really weird to me, because I thought that environment would be much more healthy than the previous one. I started investigating it really carefully and studying a lot of academic papers about education. I was stunned to discover that there's pretty broad consensus in parts of the academic community — or some very strong data — that suggests schools are not a particularly great place for most kids to really blossom, or at least entirely focus on school learning. In fact, tutoring...kids who get tutoring are in the very top, highest academic performers regardless of their previous background. It seems like all kids can be really successful given the right tutoring. Our daughter was doing all this stuff with apps, and on Zoom, and stuff during her first year. None of that is limited by the speed at which a teacher thinks a kid should go, but instead the computer is dynamically adjusting difficulty over time. So, weirdly enough, our daughter was basically at Grade 4 or Grade 5 of math after a few months of doing these apps. They're so much more effective than normal teaching. We were also trying to figure out, ""Well, how do you avoid her getting really bored and stuff?"" So I did this really deep dive into education and discovered there's all these fascinating, different ways of teaching and learning which are entirely different to what's done at normal schools. Eventually, we decided to take her out of school and instead switch to using these kind of more academically driven approaches in a homeschooling environment. Which also seemed to generally lead to better social outcomes, better mental outcomes — better mental health outcomes — and better learning outcomes. That's kind of been interesting to me, to discover this whole world of research that seems really important, for humanity. How kids should learn. It feels like, again, it's being largely ignored by the institutions that we send our kids to. Lukas: Let me just see if I got the summary of this: basically that tutors are much more effective than schools at actually teaching kids things. Is that what you’re saying? Jeremy: That would be part of it. But there's lots of...that's kind of one starting point. Yes, even kids that would otherwise have been doing pretty badly at school can be in the very top performers. That kind of is an existence proof, that pretty much all kids can be extremely successful. But then there's also this kind of interesting data point for us, which is when we gave our daughter an iPad, and some math and reading apps, and somebody on the other end of a Zoom to supervise them, she had a huge amount of fun and learned dramatically more quickly than I thought was possible. And then when she actually went to school, she basically learned nothing for the whole year and ended up becoming much less resilient. There are specific ways of learning that are not particularly compatible with the normal ways we teach at school. For example, we might have talked before about Anki and repetitive spaced learning. My daughter does Anki every day. Literally everything she learns, she will remember forever if she creates a card for it, or she decides she wants to know it. That's kind of quite difficult to do at a normal school because you'd need all of your grade levels to be doing Anki. So that in Grade 5, you've still got cards from Grade 1 or Grade 2 coming back. But what happens at school is each year...for example in Australia, the Year 7 and Year 8 math curriculums are nearly entirely a refresh of the primary school curriculum, because they kind of assume the kids are going to need to see it again, because they've probably forgotten a lot of it. Things like, ""How would you incorporate spaced repetitive learning?"" Some schools in England have tried to do something like that using something they call ""retrieval practice"". I know there's a school called the Michaela school, which I believe had the highest results academically in the whole country. They do something like this. There's a few...there's a handful of schools here and there which are trying to use these kind of research results. But they're kind of the odd ones out. Lukas: All right. Finally...I don't know if this one really applies to you. We usually ask — because my company, and this interview, is all about making machine learning really work in the real world — we usually ask like what's a hard part that you've encountered in taking something from research to actually working for some purpose? That may not exactly apply to you, but you seem very good at sort of interpreting my questions in a useful way. So I pose it in its most abstract form. Jeremy: I mean, I've had lots of projects that I've tried to bring into the real world. Lukas: Of course, that's right. Yeah. Jeremy: It's difficult. I've been doing machine learning projects for over 25 years now, believe it or not. In the early days, it was such a challenge because managers didn't believe in the power of data at all. When I would try to tell them that it could be really valuable, they would always say like, ""Can you point to a role model of a company that's been successful because of their use of data?"" And there were none. That was tough. Lukas: Yeah. Jeremy: Then Google came along, which was great, because then I could point at this one company that was really working hard to use data and they've become very valuable because of it. Nowadays that bit's a lot easier. But actually, unfortunately, my answer is going to be that I've kind of — for a lot of companies — I've given up on even trying. Because I tried to get...particularly when I was at Singularity University, where all of our students were basically execs from giant companies. We were trying to convince them to be more data-focused and some of them really took that on board. And then they would invite me to come and talk to their VP groups and exec groups. I saw lots of big companies try to get more data-driven, try to use machine learning. I didn't see any being successful. The issue seemed to be that their entire management teams were people who...that was not their area of expertise. They were not promoted because they were good at that. They would have very smart, data-driven people down in their kind of business analyst levels, that they would have no idea which ones knew what they were talking about, and have no way to kind of curate what they were being told. All of the promotion systems were based on experience, and credentialing, and things other than analytical capabilities. So, in those kinds of companies, I eventually decided, ""Okay, maybe it's not possible for a legacy company to become a data-driven company."" And so nowadays I focus all of my attention on startups created by founders that are already data-driven and have a good understanding of analysis. What we're seeing is, increasingly, the most valuable companies — or particularly the most valuable companies in America — they're basically all now ""tech startups"". I mean, they're not startups anymore, but they're all companies that are created by engineers and data-driven people. I think for data scientists interested in making an impact, the best thing to do would be to try and make sure you're at a company where that kind of work is appreciated and understood by the executive team. Lukas: Interesting. Well, great to talk to you. That was super fun. Thanks for- Jeremy: You too, Lukas. Lukas: -answering my wide range of questions. It's always so inspiring to talk to you. I really appreciate it. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out. Jeremy: And how is everything going at Weights & Biases? I always hear nothing but good things about it. Everybody loves it. I've got to admit, actually, the other day I was talking to my friend — I think it was Tanishq — about like, ""Oh, what's going on with this learning rate here? I wonder if it's working properly."" And then he's like, ""Well, here's a graph of the learning rate."" I was like, ""Oh, that was quick and great. Where did that come from?"" He's like, ""Weights & Biases, it logs it."" Lukas: Yes! Oh, man. Are we still recording? Put that on the... Jeremy: I probably should have looked at the Weights & Biases team. Here I was with like ""plot.plot(x = ...)"", and he's already got it pasted into the Discord chat. Lukas: All right. Well, that made my day. Thanks. Jeremy: Cheers, mate.",10657 "Jerome Pesenti — Large Language Models, PyTorch, and Meta",https://www.youtube.com/watch?v=zvwkVeSbiRo,3155,2022-12-22,"Jerome: When people overbuzz AI, I ask them, ""What did AI change in your life?"" What did AI change? Really, truly. Don't tell me you set a timer on Alexa or Google. That's not life-changing. What was life-changing that came from AI? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Jerome Pesenti was VP of AI at Meta, which is one of the most exciting places where AI research is happening. Before that he was CEO of BenevolentAI, and before that he was VP of Machine Learning at IBM Watson. So, he's had a long career and has seen a ton of different applications and lots of change in the state of the art in machine learning. This is a super fun conversation, and I hope you enjoy it. Lukas: The first question that's top-of-mind is just, with all the advances in large language models that we keep seeing — I know Meta had Blenderbot — I was kind of wondering if you have a point of view — or Meta had a point of view — on building a large language model differently than a DeepMind or an OpenAI, and how you think about that? Jerome: Oh, wow. You go right deep into the challenge there. I would say the large Transformer models...I think at this point, it's not just a language model, right? The Transformer and large models are starting to really be able to be used in multiple tasks. I think this is a trend that everybody is following: size, multimodality, more data, more self supervision actually and less classical supervision, rather than trying to do multiple tasks at the same time. I think this is working really well. It's why people call them ""foundational models"". I'm not sure I agree with that term. So, I do think everybody's going in that direction and that's paying out handsomely. Where I would say I'm a little bit more cautious is, I think these models have lots of problems. And solving these problems is not trivial, not easy. I would say there's two abuse class of problems I've seen, and so the people who will be able to solve that really will be onto something that's interesting. One is control. When you have these language models...I don't know how much you've played with Stable Diffusion or GPT-3. It's really, really surprising in the things it gives you, but sometimes it really doesn't give you what you want at all. It's not necessarily what you ask. Sometimes it has big artifacts that show that it's not humanly generated. And it's not quite clear how you get rid of all this. There's this whole thing around prompt crafting. I think it's interesting, okay, but I don't think you can...I mean, it's kind of scary to say you're going to do like...that there's going to be a new type of software engineering that's going to be for... Because it's so unreliable, you know. And so that's the first piece, which is, ""How do you make all these models more controllable?"", which is like you have a higher guarantee of what the outcome is going to be. The second is bias. Obviously intelligence is about bias, but if you type something...I mean, the easiest way to do it is on these new image generation models. If you type ""CEO"", guess what you get. If you type ""assistant"", guess what you get. If you type ""fast food worker"", or if you type ""banker"". It's striking. I mean, it works. Like 100% of the time, you get extreme bias. And it means you can't really just use this in production. I think it would be terrible. So, very exciting. I think everybody's seeing the trend there. It's working-scale, multi-modality, multi-task, self supervision. But, you know, they are not very controllable and they have huge bias issues. Lukas: Do you feel like there are still intrinsic cognitive limitations, like a Gary Marcus might say on Twitter? Where do you sort of stand on the promise of this technique with Transformers? Jerome: I'm definitely...you have the spectrum of Gary Marcus on the left and you have people who are extremely enthusiastic talking about AGI on the right. I'm squarely in the middle. Lukas: Oh no, this is going to be a boring interview. Jerome: Yes, yes. I mean, I can tell you some things that are very, you know, controversial. I think Gary really over-does it, because the progress is undeniable. I mean, everybody seeing the systems are surprised. I've been in the space for more than 20 years and I look at the stuff and I'm blown away. If you had asked me a year ago, ""Would we have made this progress?"", I wouldn't have guessed it. I thought these tasks were higher. But I think what happened is that the more you get closer to human-level intelligence, the more you realize that the task is much harder. Some people are like, ""Oh my god, we're going to lose our job as developers, as creators."" No way that's going to happen. We're still millions away, because as soon as you make some progress, you realize that...and it's some people I've said, but it is that the goalpost actually looks further, because you realize actually intelligence is a much wider space. It's much more complicated. You realize that the system still makes very, very silly mistakes that humans wouldn't make, but it does things that you didn't think would be possible. I am squarely in the middle, which is I don't think we are anywhere close to human intelligence. I also think that ""AGI"" is a bullshit term. It doesn't mean anything because intelligence is by definition, never general. And then I don't buy Gary, because you can't deny the progress. You look a bit like a fool if you deny that. But, it's such a much bigger problem than people imagine. As we said at Meta/Facebook, we're 1% done. And I really believe it, we are 1% done. We did go 1% of the way, and that's a huge accomplishment. Lukas: 1% what? Jerome: 1% to human intelligence. We've made progress. We've made real progress, right? But it's such...intelligence is so amazing, that you still have a long way to go. Lukas: But don't you feel like the stuff that we're building is starting to help build the next generation of that stuff? I kind of can't believe how well the code generation works. I've been using it in my VSCode. Jerome: That one is also super overstated. Lukas: You think so? Jerome: Absolutely. You are in software, right? I give you a piece of code, okay, and I tell you it's 99% accurate. How good does it give you...the problem is that generating code that's not accurate...I mean, sometimes finding a bug is way harder than writing the code from scratch, right? Lukas: That's fair. Jerome: I think the way to think of Codex and stuff like that is like an auto-complete. It's a very smart auto-complete, the same way when you write your email right now, Gmail does auto-complete. It can complete sentences, and it's quite smart, and it's quite impressive. And if you cherry-pick the results, it looks amazing and it's very surprising what it can do. But, you know, it writes something, and then you have to say, ""Well, is that actually accurate?"" You don't have guarantees, and not having guarantees in code is a huge, huge problem, right? Really bug-free code is worth a million times [of just] code. It's not the size of the code that matters. So, I'm really cautious on this one. I do think it's a useful developer tool. People will use it like they use auto-complete to write email. But it's not going to write...it's not going to put developers out of a job. No way. And especially...it's tricky when you write code, because you need to have guarantees. Lukas: Well, I certainly feel like it helps me write code faster. I imagine better versions of it could...it seems very far from putting someone out of a job, but it seems like it could make you work faster. Jerome: It may make you faster, but is it better or is it worse? You can write worse code faster, I'll give you that. That's for sure. Is it really allowing you to write...I think it will — I also believe it, right? — it will make people faster. But how much will depend on the validity of the code? If you had a system that could guarantee you that the code is accurate, that would be a complete revolution. This is not what it is, right? Again, having guarantees and having control over the outputs is something that's really one of the big challenges of these models. Making sure that what it says is accurate, that's another thing. These language models, they hallucinate. Avoiding that is really, really, really tricky. Lukas: Going back to my earlier question, now we're seeing a whole bunch of different big models coming out that all seem functionally like Transformers. You know, trained on a huge corpus at...basically all text that anyone can find, as far as I can tell, and high volume. Do you feel like the research is sort of converging on this one technique? Or do you feel like DeepMind and Meta have different strategies and points of view there? Jerome: Well, actually, you should have seen Yann's tweet a few days back. It's like, ""Hey, it's weird. Nobody talks about reinforcement learning anymore."" Which is...Yann had said — I don't know if you remember — ""That means we don't really need the cherry anymore."" I don't know if you remember this metaphor of the cake. The cherry is the reinforcement learning and supervised learning is the icing, and the body of the cake — the genoise — is unsupervised and is self-supervised. He really, I think, predicted that it would happen. And it is happening. From an information theory perspective, it makes sense. When you do reinforcement learning, you get very little information whether you're right or wrong. It's kind of binary: ""Yes"", ""No"", you are going in the right direction. With supervision, you just use a label. And with self-supervision, it's where you use the whole data, so maximizing the information you get out of the data is definitely the trend. I think that's where we're going. And, you know, you see self-supervision happening in every other field. The flip side also is, Transformers are just working amazingly well, and scale is working amazingly well, and the combination of all these right now is a trend. I don't think we have a secret sauce that would be...or we ""had"", as you know I'm no longer there. Lukas: Right, interesting. Do you feel this concern that very few people will be able to do this training at large-scale? What do, actually, academic institutions do in a world where the most exciting results are coming from very, very high-volume training? Jerome: Yeah, it is concerning. I can tell you that the costs of the system and these models...I mean, just before I left, we put online one of the biggest superclusters out there. It's just extremely expensive. I can't tell you the cost, but it's staggeringly expensive. So yes, it is worrisome and it does work. But, I do believe that we are kind of wasteful in the way we do things today. We are not really optimizing. It was very interesting to see Stable Diffusion come out really quickly after DALL-E. I'm a huge proponent of open sourcing, of open models. I'm actually...Meta had done it with OPT-175B, but it was cool to see Stable Diffusion come out after DALL-E. Not only releasing open source, but also shrink-wrapping it. Now that I'm by myself, actually I've been running it on my own computer or on a Colab. It's pretty cheap and that's kind of cool. I haven't been able to train my own version yet, but at least it's a bit more manageable. But overall, I am a little worried. I'm not seeing how we can avoid this, given how well it works. But we also have efficiency gains we can make. Lukas: We always talk about sort of the practical applications here, and how they're different than research. Can you talk a little bit about at Meta? What were the applications that really mattered to Meta that they were using, and how that kind of differed from the research interests? Jerome: Let me ask you a question because that's something I feel like— Lukas: Please. Jerome: -when people overbuzz AI, I ask them, ""What did AI change in your life?"" Lukas: In my life? Jerome: Yes, in your life. What did AI change? Really, truly. Don't tell me you set a timer on Alexa or Google. That's not life-changing. What was life-changing that came from AI? Lukas: That's interesting. I feel like my life is not that different than someone in the 80s, but by that sense...I actually love listening to music with an agent where I could just request it by saying it. It's delightful, but I wouldn't say it's life-changing. I mean, I assume that all the recommendation systems that I interact with probably guide me...I feel mostly happy about that. I remember when Amazon kind of first came out with a recommendation system, it just felt so great. It was like, there's a whole world of books that I want to read that I didn't know about. That might be the most...I don't know. What do you think? You've probably thought about this more than me. Jerome: It's a good point. Actually, it's interesting what you say. I will challenge that the first one...I don't think for many people, ""life-changing"" is that I can ask something for music and it plays it. Lukas: Yeah, ""life-changing"" is way too strong. Yeah, sure. Jerome: But it is true. To answer your question, you guessed right. Which is, at a place like Meta, recommender systems are just hugely impactful. And in two areas. One is advertisements and the other is organic recommendation. Just that...by the time I left, my team was a few thousand people and [it] justified the entirety of the budget by far, you know, multiple [times]. The ROI of investing in this system with larger-scale — especially, you can imagine, in advertisement — is really staggering. If you ask me, that's actually kind of disappointing, if you think about it. The most successful application of AI so far has been advertisements. And I would say maybe the second-most successful has been recommender systems in apps like TikTok, for example. But it's kind of behind-the-scenes. Lukas: Well, wait, wait, wait. Actually, you're a search guy. Don't you think maybe...I should have said ""search""? I feel like web search is incredible. Jerome: No, because web search came up without AI, right? Lukas: That's true. Jerome: The whole history of AI at Google, I would have liked to be a fly on the wall there. Actually, there was a...Sundar got interviewed by Kara Swisher just recently. He was talking about how much reluctance there was at Google to use AI in search. It's a fairly recent story, actually. And today, even some people...I mean, I do think actually AI is very useful in search, but I would put that in the category of ""behind-the-scenes"", you don't really understand what it's doing. But it's also a late story. Whereas in recommender systems and ads, it came much earlier as a fundamental block. Whereas I think Google worked pretty well early on with traditional information retrieval techniques. So, you're right. I mean, if you ask me to answer the question, recommenders are the big thing. The second big thing — which is especially when I was there — was moderation. Moderation at scale can only be done with AI. Moderation at scale is done. I think you can look at the stats as a report that are done every three months, but now we are up to like high 90s, and most of the things...even though there are 30,000 people doing manual moderation — that pair with AI — the amount of data to process is so great that the majority of the first action is done by AI, in the 95% plus, for things like hate speech, or bullying, or a lot of complex problems. Doesn't mean it works perfectly, but it creates enough friction that I think it does make the system overall much better. Lukas: When you scale up to that massive volume, to the massive volume of inference, what changes about how you approach a problem like that? Say, moderation at scale and trying to moderate everything that's coming into to Facebook. Jerome: I don't know if you're asking in terms of the actual application or the support of that application. Support of the application is very, very hard. I mean, the whole MLOps aspect is just...you know, and we could discuss that. It's really, really hard. I don't think in my tenure at Facebook/Meta, we solved it. We solved some part of it, especially with PyTorch — I think it was a great success — but after all it's hard. All these systems that evolved quickly at scale: very, very hard. On the other side, from a user perspective, scale is tricky because you can have the impression it works well. All our stats show, ""Hey, we made a lot of progress. If you look at since we introduced AI on hate speech, the amount of hate speech in the platform went down 3x."" Unfortunately, that doesn't mean that's the experience of people, and it doesn't mean it's true for anybody, anywhere in the world. Very, very interesting problem. The experience, for example, is very interesting. It doesn't matter if you match your policies and you remove hate speech; what matters, actually is how people experience your product. And that's a very different story. And the experience of people depends a lot on where they are in the world. The language aspect, the cultural aspects are very, very important there. Lukas: It's interesting that you say...actually, I was kind of curious about both sort of the technical and non-technical challenges, but since you bring up PyTorch, I would not have thought that PyTorch was something that you think of as sort of helping with the operations. I feel like when it came out, it seemed oriented more towards research, but I guess maybe I'm wrong there. Jerome: Oh, yeah. That's a long story. I can tell you a little bit of the story, how it happened. Lukas: Tell me the story, please. Yeah. Jerome: Yeah. So when I joined Facebook at the time — right in 2018 — the company had decided to go on a dual path with PyTorch, Caffe2, and ONNX in the middle. I thought, ""That's just such a hack. That's a non-decision."" I think the decision was made two months before I arrived. It's the one thing...usually when you join a company like this, you do not want to make decisions early. This is one decision where I told the team....actually, I didn't say, ""Hey, we should do PyTorch."" I told the team, ""No way, we're going to do this."" We needed...from experience, I knew that we needed to be on a platform that had community support. So I told the team, ""Okay, you're going to have to pick one framework that we know will have traction in the community."" They were honest, and they knew that that could not be Caffe2 at the time. The community support there really dropped. PyTorch was a rising star, but not production-ready. And really, the only one that had all these aspects was TensorFlow at the time. But the team was convinced that the model of PyTorch was better, and allowed more dynamic graphs. So they came back and said, ""Hey, we think we can make it happen. We can make PyTorch a contender, both on the research front and the production front."" And that's where the company bet. For the past four years after the decision, we've been moving almost everything at Meta from Caffe2 to PyTorch. People love PyTorch. So it's not actually a hard thing to convince people. It's just amazing. It's a better tool to do exploration. But it didn't mean we had all the MLOps around it. And to this day, we still are trying to really figure it out. It's not easy, but it was the right choice. PyTorch definitely, as you surely have seen, it's just a product that people love. And you want to start from that. That gave us a lot of traction that was the right direction. But it still lacks a lot of the infrastructure around it. And there are a lot of reasons for that that we could discuss at the end. Lukas: Do you have a theory of why it's so loved? Because we watched this firsthand. When we started Weights & Biases, TensorFlow had a clear lead. And we watched PyTorch overtake it just during our own logs. It was a really dramatic shift. It's funny because from my perspective — and I've dabbled with both — they seem pretty feature-comparable to me. I mean, in the early days, there was obviously PyTorch had the just-in-time generation of the graph. Do you have a theory about why PyTorch seems like it was so much better loved? Jerome: Yeah, I'll give you another little anecdote. I remember the reason actually I felt strongly about this when I joined Meta is before I joined, in my team I remember we had also this problem...at the time you had Theano, you had other systems. We were a small team — I was in a startup and we were in a small team — and we already had a few frameworks. I said, ""We can't do this. We got to agree on one."" And so I think we agreed on one, I think it was TensorFlow. And six months later, they're like, ""No, no no, we got to use PyTorch. No way we can..."" And I'm like, ""We made a decision!"" We went to PyTorch, and I'm like, ""Okay, there is something there."" I actually think that the reason is simple. The people who developed PyTorch — Soumith in particular — had a design mindset. If I were...the mantra, it was actually a user-centric design. It's funny because I think the people who did it didn't necessarily know they were demonstrating they knew the terminology [?], but it really definitely had the research in mind and what they wanted to do. And you can feel it. The problem with TensorFlow is that it was retrofitted. So even if now, because of influence it's there — it has been plugged on top — it still feels like it's crumbled up together. It's hard to acquire the love, you know. You can lose it; it's hard to gain. So it's really about user-friendliness...researcher-friendliness, actually. I think also the fact that research is driving the narrative in AI today. It's not a stable field, right? That really put PyTorch at the center of that universe. Lukas: What were the important pieces that you had to put around it to make it really work for you in a production environment? Jerome: The challenge with PyTorch...actually, the really complex stuff is that it's almost like an anti-pattern. Let me try to explain that. I think there's this saying that ""Early optimization is the root of all evil."" But the challenge with something like PyTorch is that you need to do early optimization. You don't have a way around it. Why? Because you need to create a system that gives a lot of flexibility to users to do a lot of things, yet is optimized. Because scale matters, efficiency and speed matter. So you have this constant challenge of — especially in the interest of the operator internally — to have things that really follow...like, if you couldn't do Transformers today in PyTorch, but it would be awesome in everything else: forget it. Nobody will use it, right? So you need to...very quickly, when you see where the trend is going, you have to go and put very good operators, and you need to optimize it. It is constant progress, they are doing this. That's one challenge. The other challenge is we had to give that team...I'm really a big believer in focus, and in this case, it was a constant balance. I said, ""Hey, look, you have two focuses. I cannot make it simpler for you, and you cannot screw it up."" One is you cannot screw the external community. You have to create something that people will continue loving. You cannot make it bloated, right? The problem when you start creating enterprise software or production software, it becomes bloated, it becomes difficult to use. You can't do this. At the same time, you have to make it work for us internally. It has to have all the production aspects. It has to be deployable, it has to be production-ready, which most people in the research community don't see, don't understand. We had to have these two objectives. And that's hard. The team suffered through, but I think they did actually quite an amazing job at keeping it, because ultimately Meta is going there. It will be 100% PyTorch in a very soon future. And I think the community still loves and adopts it. Lukas: Was there some experience that you were talking about that made you understand the value of community support? Were you using something at a different company, where it didn't have the community support? You just mentioned that a couple times, that it's so essential to use technology that the community believes in. Jerome: Yeah, because I've seen companies be stuck in a dead end. Actually, you could almost argue — maybe they're going to hate me for this — but PHP and Hack at Facebook is a really tricky one. They kind of own it. Facebook is so big that I guess — Meta is so big — they can own it. But I really think this is not very good. I think you see it dying on the vine and you are adopting a technology that just doesn't progress anymore. I've seen it for many systems. I would say all the big data systems, the containerization systems. You can see there's always one winner and if you make the wrong choice, you're stuck at some point moving off from it. Lukas: Right, right. I thought you were going to maybe mention IBM Watson. I'm kind of curious what that experience was like. Jerome: That is a very different story. I can tell you more about this. I think what...I mean, the good thing for me is that I went there through an acquisition. I had created an AI company and IBM acquired it. It was great for everybody. I was very happy. Actually, I think when IBM created the Watson units, that was a bold move. It was really about saying, ""Hey, we believe there is a commercial potential in AI."" That was 2013. At the time, actually, not many people were talking about AI. The deep learning revolution came around in 2011, '12. People were saying it's coming. Actually, Jeopardy! — the challenge when they did it with Watson — did not use deep learning, which is kind of interesting. It's a bit of a dirty secret. It used very little machine learning. It used traditional NLP and managed to get something very good. They made this big bet on it. I think it was really — obviously — the right bet. It was early and it was good. But there were challenges, right? The challenge is that you had to be patient. I tend to say, ""You need to be impatient for profit and patient for revenue."" And IBM did the opposite. They were impatient for revenue and patient for profit. They did a lot of this very large engagement, promising the moon, that you may spend $10 billion to make $1 billion. That's not a very good business. What I was focused on when I was there was to really try to shrink wrap AI and put it as cloud services. At the time, we came up with this idea of putting AI in the cloud as services to do speech, to do conversation. To this day, I think that's still the majority of what Watson is doing. I think it was very ahead of the game. But, the only problem is IBM didn't have much of a cloud. I felt a little bit stuck when I was there because I think it's the right strategy, I think we're getting traction, but I'm building on infrastructure that's not as robust as if you are on Amazon or Microsoft. Lukas: And then you went into drug discovery, didn't you? It's super hot now, I feel like. Is that right? Jerome: Yeah, yeah. I got recruited to be the co-CEO of a company called BenevolentAI. I think it's a fascinating field. I'm a huge believer that it will happen. You can see there's a lot of promising things happening in AI. Even at Meta — in the research team FAIR — we were doing things around understanding the function of proteins, looking at making predictions around free energy on small molecules and catalysis. Very interesting stuff you can do with AI today. Now, that said, it hasn't really completely changed the field. I actually think that drug discovery needs a bit of what I would call a ""Tesla revolution"", which is you need a tech company to take it head on. But it has such a huge amount of domain knowledge that it's a very hard problem. It's similar in some way to what Elon did with Tesla. It takes 15 years to understand what it takes to build a car. And I think drug discovery is even bigger than that. It's even more complicated. But the decision process of these companies — when they approach technology — they're saying ""There's no good model out there, but some models are more useful than others"". Okay, that's what they say out there. The reason is the models are more useful because they just use them to justify the decision they had made before. That's the way drugs are made these days. A lot of decisions made, not a lot of data to support it. A lot of influence, you have a concept called a ""key opinion leader"". That's how decisions are made there. I'm not a big fan of influence authority. That's not, I think, how a business should be run. But that's how it is right now. I'm really looking forward to a big disruption and maybe I'll get involved in this again. Lukas: That would be cool. When we started Weights & Biases, we didn't think that we'd have many pharma customers. And now, we work with most of them. So it does seem like at least the pharma companies believe pretty strongly that there's something there for deep learning to help with drug discovery. Do you have a sense for what the breakthroughs have been that have made things like AlphaFold work well? Jerome: Well, there are different challenges. What I find remarkable is that — and I still don't quite understand it — it does seem that deep learning and especially even the Transformer architecture, for example, are kind of able to understand the grammar of things. Of images, of text, but also of proteins, for example. At Facebook, we had a project — at Meta — where you just feed hundreds of millions of proteins to a language model, and the system from there is able to predict function pretty well. Without having seen anything, with very little supervised data. It's something that I'm just not sure I understand, because it's not like a brain understand molecules, right? That means there's this generic computation that works well in so many areas. And it just still blows my mind. I understand that it can do it for language and for images, because humans can do that. But humans can't understand...can't fold molecules or understand their functions. So, why is it working? Why can you predict...why can you do quantum calculations better with...? I don't know. It's really, really interesting. It seems to me like this thing that's generic even more than human intelligence. Lukas: Yeah, it does seem like an opportunity to do something that humans really can't do. Jerome: That's the case, yes. But there are lots...back to your question, there are actually lots...you have the chemistry, you have the biology, you have the clinical trials, you have patient data. There are actually many, many stages. There is the target identification. For BenevolentAI, one of the big things we were doing is trying to mine the literature to come up with new graphs, find new relationships, new targets. It's very, very early in the game. Then you have companies that try to figure out, ""Okay, give it a target. What are the right molecules that can affect that target?"" Can we do some AI-assisted chemistry there? And then there are people who try to understand better the biological aspects, like how docking actually works. And then you have the patient data and you have the imagery of the patient data. How can you understand it? Can you deduct from there? Can you combine that with genetic information? Actually, there's really literally like dozens of places where it can affect. I was talking to a friend of mine who just started a company to think of how to design...I think he called it ""promoters"". So, not the piece that's active, but the thing that like first [?] in an RNA-based [?], but the thing that's going to say how much is going to be...how potent it's going to be. The little code that you don't pay attention to in DNA that usually tells you how much is used and how much the cells can be affected. I had no idea this thing existed, but you need a code for there, and it's a few hundred amino acids there. Using AI for that might be very good. The advice I gave him was like, ""Hey, go use Transformers. I bet you they're going to...train them on DNA. They'll figure out..."" But I don't know about it. Anyway, there are a lot of aspects of the process where it can help. I would say dozens. Lukas: It sounds like something that you're excited about right now and looking into? Jerome: Yes, it is. Yeah. But I really...what excites me is, ""How do you get..."" I'm convinced that you're going to see a lot of what we call ""business processes"" be improved throughout the industry. I think you're going to see...it's slow, by the way. You're going to see companies adopt [AI] for part of the processes, like insurance companies and banking and healthcare. They're going to take little blocks. They're going to work with these B2B companies and they're going to adopt it. What I'm more excited with is, how do you change entirely a field? You have transportation. Obviously a lot of people are trying that, with self-driving cars or other kinds of self-driving. Maybe that's going to come first. You have healthcare and you have drug discovery, paired. I think you have education as well, that could be completely transformed. But I'd love to do something that not just takes the current companies and just incrementally improves them — which I think is what's going to happen naturally — but changes the game. I think in drug discovery, you can change the game. You can change the decision process. You can change...the attrition that you have right now that makes a drug cost $1 billion dollars will be diminished by 10x. Lukas: I totally agree with you on drug discovery and autonomous vehicles. You'd be blind not to see the opportunity there and the success that folks are having. But I don't actually know that I've seen a ton of success in education. It seems like a surprising...it seems like education actually has the least amount of technology inserted into it. Jerome: Yeah, I agree with you. It's a field I'm very interested in, I've been looking into it. The way I put it...I actually just wrote a little position document for this pretty recently. The way I put it is that I think education is completely...in the war for attention, education is completely outgunned today. If you are a teenager, do you want to go to a boring lecture or do you want to go on TikTok and see stuff by millions of creators that really is adapted to your interests, and understands what you like, what makes you...a system that gets you versus a system that's static. You know, the same way of educating as 500 years ago. It doesn't mean there's no opportunity there. I think there are. But culturally, it's also a difficult field. I think of it...the way I put it is that, look what's happening on TikTok. Kids go on TikTok...my daughters, they send me stuff like, ""Oh, look at this guy, he teaches me math on TikTok."" I'm like, ""Come on"". That's entertaining. I'm not sure that's the way to do it, but it shows you the potential to make it a lot more engaging. You have to engage the user. You have to make it compelling to them. I think there's techniques and there's AI to do that. I think we understand that pretty well, actually. That, I think, is an opportunity. Lukas: Interesting. Jerome: More to come. Lukas: Excited to learn more about this. As someone who likes to learn...I actually think YouTube has become such an incredible educational resource. Even on deep technical topics. And I think the voting is surprisingly effective too. I would have thought that it would be hard for really good educators to sort of bubble up to the surface on very advanced topics, but it seems like it's a pretty good...I don't know. Jerome: I agree. Lukas: The algorithm, I guess, on YouTube is working well for me. I've been learning more math. Jerome: I agree. And you know, when you look at...I think that's the thing that...I'm not sure it works for younger students, but I think for adult education, I think for high school education, a lot of them start bypassing the traditional way, and go into YouTube. But YouTube is also not an educational platform, right? There are other ways to learn. Personally, I love learning through practice and through exercise. Lukas: Totally. Jerome: I think people have different styles. I have a hard time staying in front of a lecture. I love practice and I love something that...the frustration I have with all the education systems today is that they don't start by constantly evaluating you. What are my gaps, what do I need to practice next? What's the optimal thing that I can do next? A lot of systems today really tell you, ""What is the best next thing I can show you?"" That's how TikTok works. So, what is the thing that's going to make you really, really want to come back on TikTok? I don't think education works like this today. What is the thing that's going to make me more informed and want to stay and continue that course? Lukas: Well, I hope you work on this. I'm... Jerome: We'll see. You think drug discovery is complicated? Oh my god, education is also complicated. That's the problem, you know. Healthcare, education, drug discovery, all these complex fields that are hard to disrupt. Lukas: Right, right, right. Some other questions I had, I was wondering...Meta has made this huge bet on augmented reality, as far as I understand. Do you think that machine learning has a role to play there, or has caused some of the interest in AR or VR? It's not a space that I understand super well, but... Jerome: Yeah, and it has a...let me give you a framing for it. The challenge with this new kind of interface...let's assume — which is not a guarantee — that it's going to be a set of glasses that you put on your head. And let's say it's going to be the next platform. Because let's be honest, I think phones are an amazing invention, but they're kind of a frustrating invention. You have a little screen like this. I see yourself always on that little screen. My prediction to you is like in 30 years, people are going to look back and say, ""My god, this is like the Stone Age of interfaces."" So, something is going to change it. The challenge with glasses is that it's not an imperative interface. I'm not typing. In some ways, a phone is a little bit less imperative than a computer or a keyboard. You're clearly telling the computer what you want. When you type, you type the key. There's no ambiguity there. I think the touch screen was a little bit more of an implicit interface. It's not exactly sure what you're saying...it's actually using a little bit of machine learning underneath there to figure out what you're talking about. But it's not groundbreaking machine learning to figure out what exact word. And it's using actually some of these language models when you type on your keyword. But imagine now you have glasses, right? There's no input. So, what is it? One of the obvious ones is voice, but it's very likely that it's not going to be just voice, for sure. It's not going to be just voice. It's going to be gestures. It's going to be motion. One thing that Meta is working on is a little bracelet, they acquired a company that did this. I think it's very, very interesting. You can maybe type in the air or move your finger silently. There's going to be motion. There's going to be trying to understand your intent. The problem with glasses is you don't have a keyboard. You can't enter information. You can't tell the glasses what you want, but you'll need to have a rich interface that understands you. And so AI has to play a role there. It's a very challenging role. It's creating a contextual interface that understands all the context around and lets you really direct the system you have on your face. Lukas: This is probably a speech interface, I'm guessing. Jerome: Speech, the problem is that...speech is part of it. But our guess is — our guess was — is that speech may not play as big a role as you think it will. I mean, when can you really speak to your phone, right? As Siri. How often do you use it? I never use it. So I don't... Lukas: Yeah, I never use it also. Jerome: Yeah, I never use it either. Because it's awkward, right? I'm in the middle here, I'm going to talk to my phone like this? Actually talking to the glasses, while it's possible — I don't know if you saw, Meta came up with the Ray-Ban. My team actually did the speech for it. It's nice, it works well — but actually, there are not many places where you want to do this. Maybe you want to do more motion. Your gestures, other things, a combination of all these things. Tap, you know. The interface will be a lot more complex, multi-model than we assume. It's not going to be just speech. Lukas: Interesting. Okay. Another totally different question that I had — that I was wondering if you had a thought on — is, one thing that's been really striking is NVIDIA's total stranglehold on the training market. I mean, there's some stuff coming out of Google, but it doesn't seem like it has tons of traction, at least in training. Do you have a sense for why that might be? It's lasted a lot longer than I would have thought. There's lots of startups that compete and people working on chips, but somehow it just doesn't seem to move. Jerome: Oh, I know. I would say I know all about it. Remember what I told you earlier, which is that these things are very expensive, right? And when you have a sole provider, it's very complicated and it's very expensive. Thankfully, now the crypto market went down, so I think it's going to be a little nicer for GPUs. But it did feel at that time like a racket, what we were paying for these GPUs. But, the flip side of that is NVIDIA is very good. And they're very good not just because of the GPUs. I think the GPU — especially when you come from more of a PyTorch exploration mode — it works well. It's very multipurpose. I think it's very flexible. That worked really well for us. But the thing also is, NVIDIA got the software really, really well. They really got it right. They work with us amazingly well. They're very competent people to create that. That's a coup de [?], and it's hard to replace. I'll tell you at Meta how I wanted...and I threw some money at other people to say like, ""Go do it, or we'll do it for you."" You got to be able to compete, you know. But software is hard, and they are very talented and they do a great job. And that's what got them there. They just have the best software...they have great hardware and have the best software stack on top of it. If you're serious, it's still the best in town. Even if you compare to the TPU, the benchmarks are comparable, yet the GPU is way more flexible. So unless you have some workloads — I think it works well for ads for Google — the TPU can be competitive, but for the rest, actually, GPU is still the best game in town and they have a great software stack on top. Lukas: You would think more specialized systems would work in more specialized cases, wouldn't you? It's kind of amazing that the flexible system also seems to function the best for almost all these cases. Jerome: Yeah, but think of this thing, the challenge we had, right? Imagine you try to design a chip, and you design it when the big game in town are CNNs, and LSTMs, and a lot of...in recommendation, it's a lot of sparse networks. And then you wake up three years later and everything has changed in the game. The game has changed, and now it's Transformers and actually dense networks start to be really relevant to do also recommendation. You design your chip and it takes five years to get it out. So by the time you get it out, you know it's already over. Which many people are doing and have done as well. It's very hard. It's the problem I told you, this early optimization. Which is if you don't keep your options open — while still optimizing what you have — you may be in a dead end. Lukas: Interesting. Well, cool. We always end with two questions, but I guess before that I'm just kind of channeling all the students that we always get in comments, wherever we post these. You've had this very enviable career in machine learning and we have so many students that use our software and watch these interviews. Do you have any advice for students coming out? What would you work on if you were just sort of entering the field out of undergrad or grad school? How would you think about that? Jerome: Well, I would not...I'm not going to give you specifics, but I'll give you a little story that I got from a guy who used to study ants. He just died recently, E.O. Wilson. He invented a really interesting concept around evolution and he wrote a little book, ""Letters to a Young Scientist"". He says, ""When I was young, I came out and..."" He was in his PhD, and he decided to focus on ants. The amazing thing is, at the time, that sounded like a very crazy idea. Obviously, ants we know as a society are very important now. And he became the world's specialist in it, world-renowned. What I tell people — especially in science — who come out is, ""Don't be afraid of going for something that you own, that's your own thing. Go for it. And be bold about it. And actually, don't assume that everybody has done everything. There's a lot of opportunity for you to own it, and go for it, and be focused on it."" That's what I would advise. I think this is a very wide space. There's a lot of space for everybody. Be bold, be ambitious. Lukas: Fair enough, all right. The last two questions are...one is — and it seems like you're kind of doing this — but if you had extra time to work on something, what would it be? Jerome: It's what I do now, I told you I'd do kite-surfing. Lukas: Yeah, totally. But if you weren't kite-surfing all day long, what would you be looking into? Jerome: Well for me, I'd do two things, because...one is, I was a goddamn manager for the past, like, 10 years. I think the last time I coded was before my company got acquired, and I love coding. So I'm going back to coding, I'm going back to getting my hands dirty, really understanding... As much as my team developed PyTorch, do I really understand it? Do I understand how it works? I'd spend more time doing this, and that's a lot of fun. Lukas: I love it. Jerome: I think Karpathy, just coming out of Tesla, he said the same. My skin is cleaner, I sleep better. Dealing with technical problems rather than people problems is always a big boost. That's what I'm doing: really, really staying up to date. I feel it's really critical to understand. My next stage is, ""Okay, I want to write a Transformer from scratch, what is that? What is in it?"" Lukas: Nice. Jerome: The second one I'm trying to do is really try to evaluate where the big opportunity is. For me, I feel like, ""Okay, I've done the B2B startup, I don't want to do another one like this."" I want to try to see, ""What's going to be the big revolution here? Is that going to be drug discovery? Is it going to be transportation? Is it going to be education?"" I'm going to pick one, I'm going to make a bet, I'm going to go for it. Maybe I'll fail, maybe there's 1% chance I'll succeed. But at this, it'll be worth it to. Lukas: Nice, I love it. Final question is, when you think about taking a model from research to deployed in production and useful, where do you see the major pitfalls? Where are the pitfalls that might be surprising to someone that is just a researcher? Jerome: Oh my god, it's so complicated. It's actually really...it's something I feel like we haven't figured out. I mean, I'll reverse the question, which is, ""What makes DevOps good?"", right? Do you want something that's reliable, that scales, that you can test? Testing in AI, it's hard, actually. How do you test? You can test like...tests that are very close to the model, or you have downstream tests. Imagine you change the speech recognition and you have 20 systems with 20 layers on top of that. How do you test the last system and what depends on that? ""Reliable"", well, these systems...we claim they are deterministic, but they are not, actually. A lot of behaviors are really weird, that you cannot actually completely reproduce, right? And then scale. These things keep scaling. Every year at Meta, we were like 10x bigger, and it wrecks havoc on all your assumptions. It's really, really hard. It really breaks...the assumptions you want to have to create this, they're just not there. I don't think we have figured it out. I think it's still a work in progress. Lukas: Awesome, well, thanks so much. This was super fun. I really appreciate your time. Thanks, Jerome Jerome: Thank you so much, Lukas. Lukas: That was great. Thank you. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.",8875 "D. Sculley — Technical Debt, Trade-offs, and Kaggle",https://www.youtube.com/watch?v=1aajTQvZJ94,3626,2022-12-01,there's plenty of physics that you can do in the world as far as I understand if it doesn't involve having access to a super collider or things like that and similarly I believe that there are and will continue to be a lot of machine learning that doesn't rely on having access to collider scale resources for machine learning you're listening to gradient descent a show about machine learning in the real world and I'm your host Lucas bewild I was recently introduced to D Scully as the new CEO of kaggle which is obviously an amazing site that we all love but I later learned that he was the author of machine learning the high interest credit card of technical debt which is a paper that inspired so many people including myself to go out and start machine learning tools companies so could not be more excited to talk to him today a note to our listeners this conversation took place in August 2022 since then kaggle has only continued to grow all right well it's it's great to to talk to you and I think the impetus for talking was you taking over kaggle which is a really you know important website in the machine learning community and important to a lot of our uh listeners and and users at weights and biases but I realized in in researching you which I should have realized that you are the author of the um machine learning Hiatus technical debt paper which I think um inspired a lot of people um and you know really resonated when it came out um with me and so I thought maybe um you could start for people who haven't read this paper by kind of summarizing it um and I'm also curious if anything has changed since that paper was written I'm trying to remember now this must be like 2016 or 2017 that I think 2015 yeah 2015. if I remember right yeah it feels like a million years ago um but yeah maybe before we get into it for for I think a lot of people have read the paper but for those who haven't if you could kind of summarize that the paper that would be a great place to start yeah sure so um official hi thanks for thanks for having me brother appreciate being here um yeah so you know my journey in machine learning you know it's it's been uh you know a couple decades at this point um I spent a long time at Google working in production systems so some of Google's most production critical ml systems uh um for for many years uh let some Google's ads click through vcdr systems for a while um and during that time I gained a really clear appreciation for the importance of uh you know machine learning uh as a critical part of larger important systems and uh got to experience firsthand all the different ways that things can go in unexpected directions uh and yeah these were systems that obviously had been around for for a long time you know at at the time that we're talking about I guess 2015 or so the systems had already been you know in use in uh in production one form of fashion for uh for more than a decade uh and so at that time um I feel like you know my team and I had some insights into how things work in machine Learning Systems over the long term that not too many other people were in a position to to be able to reflect not just just because just relatively new field at that point um so I thought it was useful to sort of write some of the things down that we were seeing and um using the the metaphor of technical debt I think was a was a useful way to frame some of those things because you know when we think about technical debt from a software engineering perspective um you know we think about the kinds of costs that you incur when you're moving fast uh and you know you probably know something about moving fast and startup land and uh you know uh maybe having to make some some tough calls between um you know getting something out the door now versus you know adding in another you know six layers of integration testing or whatever the the trade-off might be um so there are really good reasons to move fast um it's uh sometimes unavoidable but in doing so we create some costs for ourselves over time that need to be paid down it's not that we can never take those costs on but we better be honest with ourselves about what those costs are and at the time um I think it was underappreciated uh how much technical debt can be occurred uh through the use of machine learning and so um you know it's kind of obvious to to sort of see that a machine learning stack is built on code uh and so you know has all of the technical debt opportunities that normal code has but then it also has these system level behaviors that emerge over time um that have you know nothing to do with sort of code level checks but do in fact create cost that needs to be down so yeah even the simplest things you can think of like when you're first building a model you know oftentimes if you're in a hurry you you rush and you put a whole bunch of features in the model everything you can think of you put it in there you know accuracy is 0.9 you're like okay that's pretty good but I can think of another 20 features and you put all those you know 20 new features in and now it's 0.92 and then you're like well it's pretty good but but if I put another 20 features in uh then I get 0.93 and so we're sort of in this you know uh regime with diminishing returns to some degree it's not necessarily clear when we're throwing all these features into a model um uh what the value of each one is and it's possible that we're putting a lot of features into a model that aren't particularly informative or where the information is being usefully conveyed already by some other feature or things like that it's sort of like a bundled approach um it's typical of sort of early development into machine learning pipeline uh so you know we've been accuracy go up what could what could be the problem right uh so you know as I'm sure you've seen that every time you you add a feature into a model you create a dependency yeah you now have a dependency on some Behavior observation in the outside world and this means that you have a vulnerability uh if that behavior in the outside world changes and it could change because you know people in the outside world change it could change because the Upstream producer of that signal changes maybe they create an upgrade uh which sounds to them like a really great thing but your model has learned not on the upgraded signal it's learned all the weird errors from this aren't around them say you could get some some weird uh behaviors at upgrade time maybe they get sick of creating an ice feature and turn it off that's not going to be a good day in in their production system and so it's really important that when we're thinking about model development that we're thinking about the long-term costs of adding system complexity and model complexity and data complexity at the same time as we're thinking about improving accuracy and I guess you've really experienced this um firsthand or is there any like specific things that happened where you you really thought like oh like that that drives this point home well um so I'm not gonna tell any Tales out of school of course um but I will use the phrase you can imagine a lot uh and uh you can imagine why but um you can imagine that if you had a model that um uh was using uh let's say a topic model from some Upstream producer uh maybe that topic model that takes text and returns a sort of low dimensional representation it's sort of the topicality of that kind of piece of text uh maybe in the early days of development of that topic model at night it might not have had great coverage of um non-english languages uh and so um if Ace if you're training um a model to take that topic model as an input feature that it might learn that the topics reported for certain uh low coverage languages aren't particularly reliable um uh for for whatever reason maybe it assigns them a slight negative weight or something like that um and then um and it's not too important because they just don't fire very often so it doesn't show up in sort of aggregate metrics and then you can imagine if you were a nascent machine learning engineer and didn't know any better you learned that there was an upgraded version of this model uh that uh dramatically increased coverage in some of those low resource languages that now those topics might fire with much greater frequency and so what uh if you don't retrain your model you can imagine that now those uh topic level features inside your model are firing much much more often and maybe sending a lot of content to lower scores than you might have expected so you know uh that's the sort of thing that can happen you can imagine things like an upstream producer of a given signal suddenly going offline without warning and and data is transitive so you know it might be that the Upstream producer of a signal that you're consuming also has an upstream producer of a signal it's consuming and that that chain might hop several several links um and so it could be that your system is being impacted by some other Upstream signal uh you know several hops up in the fold and if you're not really careful about making sure that uh alerting and um things like that are also being propagated transitively you're not going to know until it's until it's hitting in your production data and so you know these sorts of things can happen um and you you want to be you know defensive as possible right so working on your early warning alertings and all these things to make sure that if something's coming down the bike you get notified in events um you also want to think about you know we talk about coding defensively and regular engineering um you know coding defensively on data often looks like monitoring of your input data distributions checking for things like sudden changes in input data skus or streams uh one thing you could imagine is uh let's say you have a model that is consuming data globally but for whatever reason a a data center in a given part of the world goes down that day like it can happen suddenly your input data is likely to be highly skewed from what it normally looks like because you're missing a giant chunk of data especially if there are say you know large time of day of local time of day effects you could have very uh different behavior for a given day or period of days through an upstream outage that if you don't have the proper input stream alerting about you might not know what to think about do you feel like these problems are getting better or are getting worse and how do you feel like the the change to kind of more complicated bigger more black box models um affects this calculus in 2015 when we first wrote these papers um we got basically two reactions uh one was the sort of you know uh very nice affirming reaction of oh my gosh this stuff is so important thanks for writing this down we wouldn't have thought of any of these things or more often yeah we've we've encountered some of these things but we didn't know that other people did too you know uh those those kinds of reactions um the second major reaction that we got was uh from large parts of the animal research community that was basically what are you people talking about um and uh you know like that that first nurbs paper uh got um you know a uh a full you know poker hand straight over a few scars you know all the way from the highest possible lowest possible a couple win the middle like just no no idea really what to do with it and uh you know eventually they they let us in um uh mostly on the like well you seem to be passionate about what you're talking about you uh people disagree with you maybe so why don't you come and hash it out it was very reasonable statement we were happy to do it um but um I think you know the world here in 2022 understands that these these issues are real that their real work say um aren't just an accident or you know what happens if you hire the wrong email engineer or something like that they're they're systemic and so we need to approach them systemically so now there's this whole field of ml Ops and when you say you know ml Ops people nod sagely and say yes yes we need to invest in mlaps um you know uh it's it's a totally different world from that perspective in that you don't have to convince people um the these problems or problems um uh that that message I think has gotten through and and I'm happy about that um in terms of you know when you have much larger models do these problems get worse um they certainly get more acute and um you know I'm not gonna say that we're in a worse spot because I think that having you know the whole field of really smart people working on these problems and creating you know infrastructure that can help address them and things like that is is a better spot to be in than having people think about these problems for the first time or rolling their own but from a a reliability standpoint um as our models get larger and larger uh you know why are we making models larger and larger we're making them larger and larger because we want to learn usefully for more and more data why are we throwing more and more data at a problem um you know it's if you were thinking of you know the problem say estimating the probability that coin is coming up heads you know you don't necessarily need to go from a billion to 10 billion examples right like basic statistics always say that yeah after a couple hundred flips you're going to get a pretty good estimate you can stop right but we don't do that with machine learning we keep going because um we need our models to exhibit ever more fine-grained behaviors and to respond usefully to a wider variety of input environments and scenarios so we have larger and larger data sets because we need to have more and more and more behaviors that our models can adapt to and can exhibit now if you were to tell a typical software engineer hey the system that we're building used to need to have a thousand behaviors and now it's got a million that person would probably say well our testing is probably also going to be a priority here and you know we used to have you know maybe 2 000 unit tests you know two two for each of these behaviors now you're telling me we've got a million like uh we're gonna have to hire a couple more test Engineers right um and maybe maybe many more um when our models are being relied on to produce many many more behaviors in a useful way uh I think that this really UPS the stakes on our overall processes of uh vetting and quality assurance uh and uh sanity checking and validation of our models uh you know the 20 years ago you have machine learning was basically like look you've got your test set and your training set and so long as they're from the same distribution um we're just going to assume that your your test data has all the behaviors that you're going to need to worry about so no problem just make sure you've got good accuracy on your held out um uh test set and that's not a silly place to start but it's probably not a great place to end um you know why do we use IID data sets from the same distribution for test and training um you know everybody knows that this is what you quote unquote should do but let's let's remember why we're doing this we're doing this because um uh there are clever statisticians who for many decades uh have um uh said important things like correlation is not causation right um and the Machine learning people are like well we're gonna just learn through correlations right we're learning from observational data we've got giant amounts of observational data so we're just going to learn from that and the statisticians are like well what are you going to do about the whole like correlation is not causation thing and the Machine learning people's response is well if we guarantee that the test data is from the same distribution then in terms of outcomes we can ignore this inconvenient fact that correlation is not causation um and the statistician people like well that's not awesome but I guess you're right and so long as you promise that you're testing will always be from a same distribution we can't really argue that yeah obviously it's a caricature and I hope not to offend any statisticians or machine learning people in this but um but so we do this IID test trains but not because we think this is how the world works but because if we don't do that then we expose ourselves to a whole set of of uh much more difficult problems in terms of the learning settings that we're in and you know to some degree you know all of the theoretical guarantees of supervised machine learning rely on this this assumption that we're going to be staying in this IID test transplant world uh and so this is all fine with the one you know small problem that the world actually almost never works this way you know we can um you know offline do our little research idea of like saying okay well I've got my data set I'm going to split it carefully and so these are are therefore from the same distribution but when we go in and deploy a model of the real world it's pretty unlikely that the data that that model encounters is going to be from exactly the same distribution that happened to be in our limited historical snapshot of data that we we collected you know previously because the world tends not to be that kind to us and so our models are going to encounter data from different distributions uh they're going to encounter worlds in which correlations that existed uh spuriously in our training data do not hold or maybe are explicitly broken in our production environment and so this means that we have to really up our game on uh evaluation it means that we can't just rely on test set accuracy or things like that as our final validation we need to you know be much more rigorous about you know cataloging for ourselves and you know talking to our clever domain experts and things like this to tell us okay what are the places where our correlations are going to break down where might our blind spots be and how can we create specific stress tests to uh to analyze our performance in these areas well it's funny though because I remember when um in the very early days of of deploying machine learning that um having a held out a test set that was randomly sampled was actually kind of an improvement over the people's kind of first intuition which is to just kind of try a bunch of different things and and be like I really want everything to improve and I mean I think one thing that can come up when you have lots of different evaluation sets and different constituents is you know some number is going to go down if you have submission evaluation sets on any new model um You release it it's hard to have kind of like a principled process for um you know getting a new model into into production I'm curious how you um think about that or kind of combat that because I'm sure you're you're many more steps ahead along that journey and the work that you do yeah so you know what happens when you have uh have a model that is better in areas but worse than some others and how do you make the call and who chooses these are really important problems um you know you uh there are people to know um a lot more about the world of ml fairness than I do but um uh I think it's easy to see that many of those kinds of fairness issues some um you know human bias issues can can creep in when folks are making decisions about you know version a versus version B and you know where are the improvements and where are the detriments to for a given level uh Improvement or update um so some of these are going to be judgment calls um uh I I think that to do this well um it's it's really helpful to have some standardized practices uh so once digitized practice that I think is underutilized in the field is to have really detailed write-ups um on every single model change you know uh that is being proposed for a new production launch uh yeah almost like a paper or a mini paper just about that one change analyzing it in depth um so that yeah we can have uh some usefully distilled knowledge about what that change is um uh I think that you know machine learning people often play a little bit fast to loose with their experimentation um and uh you know I mean the fact that it's useful to have infrastructure the two supported notebook of experiments and like this is an improvement like it's a really great thing to have but it also says something you know to some degree about the uh the state of the world where where something like this is is seen as a a really useful Innovation which of course it is um but um you know so number one making sure that every single change no matter how small is is carefully um uh analyzed and written down I I really do feel that writing things down is important you know as much as I love having an automated system that that collects all of your past experiments and sort of gives you the numbers I think that that human step of reading through the numbers and you know um drawing a conclusion and and writing that conclusion down in human language so that it can be uh discussed and poked as is a really important step you know to First approximation I think it's science is what happens when you write things down and it's important for us to be scientists um so then you know what's what's standard practice uh everybody brings their write-ups um into a a meeting and um people will talk about them and there have to be you know a couple people who make the call in the end but uh but these things should be discussed they should be debated they should be um you know uh looked at from every lens and uh with you know really carefully with as much data and insight as we can bring in these problems and then and then you know use Lane farmer votes are going to have to make a call but uh but they we should be giving those decision makers as much of context and insight as they possibly can yeah that makes sense I I guess another change big change that's happened since 2015 is many many of the new applications and models operate on unstructured data and I think there's sort of an implicit assumption even in talking about features that were operating on tabular data which I think was the vast majority of use cases in in 2015. do you think there's anything that kind of changes about what you're talking about when um the inputs are you know images or movies or audio files where you probably can't worry about the distribution of like the third you know pixel in every image like it's hard to say what that means even no so it's a great Point um I think that the basic ideas still hold and I'm enough of a dinosaur that I I say features um uh you know sort of as my go-to but I think that the same ideas hold directly even in unstructured data like images like video like audio like you know relatively unstructured text um you know I think the uh uh the first line paper had this really nice example of Huskies on snow backgrounds versus non-snow backgrounds um and I don't think that we have to have extracted a feature you know is snowy background um to to see the point here right um the questions are you know what are the qualities of the data what's the information that's being contained in the data we can often talk about that using the language of features um but it's it's I think it holds generally for any sort of correlation um that's going to exist in our input data and so you know that could be the moral equivalent of smelly backgrounds or um uh you know backgrounds in an image or facial characteristics uh in certain populations or uh any number of of uh characteristics that can come through on video uh or image you know um there's there's some pretty interesting um uh stories of um you know cancer detection uh on images that might have had uh Sharpie circles written around uh some of the images when they were annotated by the original doctors or things like that you know like do those corresponds to literally literal features no but they're that they're certainly uh qualities of the data we need to be aware of in the same way that uh for audio input data um you know speaker characteristics uh and being you know inclusive of a wide range of speaker categories is really really important so I guess I I do want to talk something something about kaggle because that's that's your new Java I'm I'm curious how it's going but I'm also um curious to know what got you excited about about joining kygo in the first place like it's kind of an interesting choice because you know so many I mean I love kaggle I think it's it's played a bigger role in the ml field that people even maybe realized like it was the first place I think a lot of people saw deep learning and it really working for example um but the the criticism kaggle and I think there's some truth to it has always been that you know kind of making a high performing model on a specific data set is sort of the least of the problems of getting you know machine learning to to work in the world and I feel like you're like this real expert on getting you know machine learning models to work in the real world um so how does that connect with you um joining kaggle yeah so um great set of questions so first of all I'm really excited about being part of chemical I um uh have had touch points with kaggle at a couple different points um I I ran uh you know one of the uh uh early competitions and then we we ran another uh competition called inclusive images a couple years ago as well so I've known the team for a long time and I've been a a big fan of the platform um I don't know if you've ever seen any of the papers that I've written around uh you know the sort of state of the machine learning field in general but I I feel that we are at a bit of a tricky spot in um the life cycle of the field of machine learning research we're at a place where there are incredibly strong incentives for people to be publishing papers um I don't think I need to oversell that now but it's it's true that that you know publishing papers is a big deal um you know when you add it all up there's something like 10 000 papers a year you know give or take publish to top conferences each year um but there's a sort of interesting thing uh each of those papers is claiming uh you know 0.5 percent or one percent Improvement on some important problem but happily really improved the field by five thousand or ten thousand percent per year like I don't think so uh so something interesting is happening there if you've been involved with um conferences either as a submitter or a viewer or an area chair um you'll notice that uh our reviewer pools are getting freezy tapped out and they have been for some time you know in today's conference reviewing world it is often the case that uh uh reviewers may be first-year graduate students um which is you know like obviously wonderful that they're performing the service but it's quite a different thing to be getting um a you know high stakes review on the quality of piece of of you know research from someone just entering the field versus somebody's been in the field for many years and and this is just a function of the growth of the field the growth the field has been uh you know pretty astronomical you know uh the number of papers uh you know sort of appearing per year I believe is growing exponentially it certainly was the last time I checked um and the number of qualified reviewers is not growing exponentially so this is interesting um as a field it's easy to see that we're sort of fragmenting um drastically across you know many many benchmarks as a field we're really pushing this idea of novelty it's it's quite difficult to get a paper published without a a novel algorithm um and you know in terms of science uh I think that this is leading to a world where we don't necessarily have the best understanding of um the algorithms that we think are the are the best or they go to because we're so busy inventing new ones um and just as a comparison point I I I no one would confuse me with a physician uh but my understanding is that in the medical world doctors uh often publish papers that are you know case studies about um uh you know diseases or treatments or stuff like this uh I would certainly hope that there is not a strong impetus that every single paper that is published in the medical field has a new treatment you know like if novelty is like the number one thing in every single you know uh medical thing has to be testing something new I'd be worried as someone who likes to go to the doctor to get healthy now in the medical field we often see meta-analyzes we often see replication results we often see case studies that sort of you know say reporting the experience of a a given trial um or a given treatment or things like this and those kinds of papers are largely missing from the field of machine learning research right now and I think it's a problem when I look at kaggle I see a world where we're able to promote much of this kind of missing work when kagglers approach a problem there are often you know thousands of teams competing um to solve a given problem this means that the the level of empirical rigor is you know to my mind simply unmatched by any other process um uh and they're you know compared you know side by side but yeah so we get this nice leaderboard effect and things like this but they the community is also like folks are committed to doing their best but they're also committed to sharing and to communicating their ideas and so you know through uh the notebooks platforms and other things like this that we have in the discussion forums uh there is a tremendous amount of knowledge um being shared captured disseminated that is it's just this incredible resource for the field and it's the kind of knowledge that isn't about novelty it's about Effectiveness and it's about rigorous understanding and so to me that's that's deeply compelling and something that I'm really excited to be a part of now I believe that we can do more to to help distill and share the knowledge that the the community is is generating um but it's it's there in you know implicitly in all of the discussion posts and all of the notebooks and all of the competition results and things like this um so I I find that really exciting and really about compelling and I asked about ml Ops and things like this you know I obviously that's that is part of my background and you know for me to go and say look we've we need really rigorous in-depth analyzes of all our models and then for me to you know then notice that on kaggle you know almost all of our competitions have like a single number summary metric is the the output like yeah I notice a tension there um a I think that over time we'll be pushing to help create more uh competition environments and other environments that allow people to uh experience more of a production environment to be evaluated more on their ability to to do things that are you know make sense in a production environment uh but we just had a competition close that measured efficiency as as one of the evaluation metrics I think things like that are really important uh we can do a lot more in that area so we're gonna you know push to make sure that the community is continuing to go in the most interesting and most important directions I think that's good for everybody uh but overall I view you know kaggle as one of the great uh uh resources in the ml world right now uh I think it's been significantly underappreciated relative to the contributions it's already made as a as a community but I think that with the little bit of help and guidance we can do even more yeah I mean I feel like kygo also does kind of an amazing thing of giving lots of people access to machine learning like you know it's a super friendly community and there's a lot of learning resources um and I do know a lot of people that kind of got their start in machine learning in kaggle and if they'd had to go you know back to school to get a PhD to engage in machine learning they they wouldn't have done it for sure so I think that's an amazing uh thing I I wonder though it's funny you know it's funny because it you know you just said you just talked about you know kind of papers where they're trying to you know eke out the last like you know 0.1 percent of performance and and that does seem like something that kaggle um you know really celebrates and there's there's part of me that like loves that like I think getting you know the last bit of performance out of a model is actually a pretty fun um experience absolutely right you know I I'm not going to argue against really accurate models right you know um I I think that the thing that's most interesting though is you know a finding out what the header is is really important for any given problem and you know from a machine learning perspective you know we're often saying things like well the model is the most important thing but all of these competitions are in application areas where there are people who really care about the you know solving their problem you know whether that's you know helping to save the Great Barrier Reef or identifying uh whales or uh helping to detect credit card fraud or anything in between you know those folks really care about solving important problems for the problem's sake not necessarily machine learning standpoint so making contributions on that side is also really important but but what I find when when folks are motivated to squeeze every last you know percent out of a uh machine learning problem as a challenge it leads to an incredible diversity of approaches and that's the thing that I think is most interesting is not you know necessarily that there was one winning solution at the end and we all you know celebrate that winner as an awesome person although they are awesome people who should celebrate them um it's the we also get a huge amount of information about other things that were tried and seemed like good ideas but didn't work as well for whatever reason um you know we you can think of this as like ablation studies at scale um uh so it's it's not just the position at the top of the leaderboard that's interesting information uh the fact that we do have you know thousands of teams participating and we need the sort of competition structure to make sure that folks are are you know uh properly aligned but the the results that come out of those I think are interesting you know to to distill up and down uh leaderboard although it's funny I mean even without the competition structure there's a lot more on a kaggle these days than the competitions such as absolutely and fun right I mean I I think when Anthony was was talking to me on this this podcast a while back here saying that the data sets was maybe even more popular in the competitions which I was surprised to learn so so we do have uh you know I mean candle has has become you know a really interesting set of resources for the world competitions is definitely one of them but you're absolutely right we have more usage of kaggle um for people looking to access data sets for their own machine learning needs then come to us for competitions and that was something I didn't know um uh before I joined kaggle but it's something that I've come to appreciate very deeply we have you know I think 160 000 publicly shared data sets on cattle uh it's an enormous Trove of information um and what's great about data sets on kaggle is that they're not sort of static things there's opportunities for the community to post little discussions and notes and things like this to post example notebooks so that it's not just about you know getting a CSV file with a lot of numbers in it it's about understanding what's in the data set where the wax might be where the strengths might be and just having a really rich amount of annotation that's sort of evolves from the communities involvement in these data sets now I think there's even more that we can do and I'm excited to do that but um uh you know the data sets are a fantastic resource the notebooks are an incredible resource um you know there's an enormous amount of publicly shared notebooks uh you know hundreds and hundreds of thousands of shared notebooks that have example code that have really carefully written explanatory text so yeah if you're looking to to really learn how to do something and you want some some great examples coming to kaggle and surfing through example notebooks that have publicly shared is a fantastically valuable place to start we also have a wide variety of learning courses for folks who are just ramping up and getting their feet wet I think it's important that we provide those on-ramps so that we can really be sharing uh you know machine learning knowledge is widely as we possibly can so I mean how do you think about the success of of kaggle do you do you look at it like uh like a consumer website like are you trying to increase the weekly active users or something like that or are you trying to make money with it or something else how do you think about that yeah so um I think that kaggle was basically you know the rainforest of machine learning it's this incredibly Rich incredibly valuable ecosystem um that the world absolutely needs and that we probably can't get by without um there's not like a direct Revenue model and I'm not super afraid about that in the same way that I'm not um you know super worried uh when you know companies have a very large research Wing or things like that that might not be you know directly Revenue generating I think that the knowledge that kaggle is generating for the world the value that kaggle creates for the world is so valuable um uh that we we can make a very strong case that this just needs to exist and um you know as a team we're pretty Scrappy um you know uh it's amazing that we have a you know we've crossed a 10 million user um uh threshold uh with a team of 50 right like it's it's not a huge operation um and the the work that folks do you know from the you know notebooks teams to the uh data sets teams to the the folks creating learning content to our competition uh teams these books all worked really hard they're amazing people but they have an incredibly large influence across the world for what they're doing so in terms of you know how do I think about kygo I think about kaggle as an ecosystem this ecosystem has a bunch of different parts that that interact with each other you know we have folks who are coming to us as novice Learners we have folks who are coming to us as practitioners and you know maybe they're you know already doing machine learning on a daily basis is part of their job maybe they're you know quite advanced in their studies and hoping to be to be doing machine learning on uh on a daily basis very soon we have Cutting Edge researchers um you know Jeff Hinton was a famous early winner of one of our competitions um uh we have a you know a large engagement from Cutting Edge researchers and they bring different things to our community and they enrich the community for each other now without the the novice Learners I think that we would lose a ton of uh sort of enthusiastic energy and uh you know sort of keeping us on a stress testing uh without the practitioners I think that we'd be losing a lot of you know real practical know-how and and knowledge for the community that get shared really really wonderfully uh without The Cutting Edge researchers we probably aren't able to have anywhere near as interesting a variety of competitions that are being hosted um or you know the the real uh Next Generation uh Solutions coming down the pike um and of course you know our you know as you say you know competitions isn't all what we're about yep if we don't have the notebooks then I think that we lose a lot if we don't have the data sets I think that we lose a lot so these things play together you know in a sort of interconnected web of machine learning in a really interesting way and I think that thinking about kaggle as a valuable ecosystem and celebrating you know sort of the ecosystem Viewpoint of evaluating whether we're doing a good job is the right thing uh but so then how do you measure the the ecosystem is it is it by usage is that the yeah so you know what is our one magic metric um yeah how do you measure an ecosystem's health I guess yep absolutely so um uh that is something I typed into Google on week two of the job uh how do you measure yeah how do people who study ecosystems measure health and and it it is uh absolutely a thing that requires very gated analysis um and so you know when you talk to um uh an ecologist about how they measure ecosystems they'll tell you look you know we can't just measure whether the butterflies are happy right we can't just measure whether the birds are happy we actually have to have useful metrics on each of the different segments um and so uh you know we've got sort of a usefully defined grid of metrics I'm not going to go into them all here um uh that help us look at each of the the different segments that we we care a lot about and think need to be healthy but really what we're looking for in the end is not being great in one area and then terrible and if I'm bunch of other areas but to to have you know what we call sort of a green flush of being you know very good across all the different uh important areas of our ecosystem so these are like kind of watching people doing behaviors that makes you think that they're happy and successful in what they're trying to do yeah I mean watching people's behavior sounds creepy and we we don't do that um but uh yeah things like uh you know uh everything from looking at how many notebooks are being created on a daily basis to our competition uh uh participation to uh uh you know survey responses and things like this to make sure that our folks are happy to looking at you know the bug reports that are coming in um so looking at long-term metrics like you know number of papers that uh are you know uh citing kaggle one form or another um cool unless I checked they were almost 50 000 of them um I mean you know so they're a wide range of ways that we can assess whether we're doing a good job do you have new things that you want to try to do or things that you want to change like are there new people that you'd like to introduce kaggle to or new ways that you'd like toggle to support um existing people yeah so um you know you asked about this so uh a little bit tangentially earlier you know given my background I think it would be pretty surprising if we didn't push towards some you know more sort of production grade mlopsy style uh uh pieces and Gaggle over time and some of those will certainly be competitions um you know judging a model only on the basis of its uh accuracy by itself is probably not sufficient for everybody's needs uh in 2022 and so we need to be able to provide ways to uh to help uh folks evaluate the models uh on other dimensions including efficiency um and then to also create you know useful and compelling and interesting challenges um I think that uh there's a lot that we can do in the world of benchmarking um and uh you know right now our our main benchmarks are really sort of competitions um but you know given that we have data sets so we have notebooks um uh you know I think that we can move into becoming you know much more long-running benchmarks and Via a repository in service to the community in that way um so um in terms of our you know our sort of user groups and populations um uh we have a really strong uh emphasis right now on uh Outreach for underrepresented populations and machine learning um and that's going to continue for sure uh and when I look at um sort of levels of expertise in our in our community I think that we're doing a pretty good job right now of serving novice Learners you know as you say you know almost everybody who learns machine learning comes to kaggle at some point in their Journey so we want to make sure that we're continuing to serve those folks really well and providing as many on ramps as we can and making making that experience be a really good and really beneficial one um but I think that uh you know we're doing well there and we can really improve on how we're serving the practitioners and engaging the the more sort of cutting-edge research parts of the world as well do you think that um there's any downside to the the framing the competition framing of of kaggle for someone you know getting started like it it's funny how friendly the community is for the idea that you know what what people are supposedly doing is is competing with each other like do you ever think about that that you know for some people they might you know not want to kind of compete with other people for the the most accurate model or something yeah absolutely so I've got two responses to that one is that um yeah we've got our featured competitions where people you know might be winning you know aiming to win some you know prize of uh you know a lot of money or something like that and and there you know people you know many of the competitors are trying to win right uh whether it's winning the prize or winning you know a gold medal in our progression system or you know become a kaggle master or brand master and those are really great and important things to be pushing forward um we have other competitions um that are uh called playground competitions that are designed much more to be an on-ramp and less about you know winning a prize and more about testing your skills but even for the featured competitions um one of my hobbies is you know I'm an amateur marathoner and I like to run marathons um It's a Wonderful fun thing to do um uh and you you get out there you know like all people are cheering and clapping things like that and and that's true kind of no matter where you are in the race and you know spoiler alert I'm not at the front right so um I think that there is something about having a an environment that is framed around a competition that can still be about participation and self-growth that is really important I think really inspiring to a lot of people and that we can um you know make sure to be emphasizing and and have be part of the kaggle experience um you know it's really important and we hear our users telling us this that yeah uh lots of people are coming to not necessarily see if they're going to be first or second but but to improve their skills to share knowledge share ideas and to learn you were most recently a Google Brandon and you know it's sort of you know I think about the work that's like coming out of um you know open AI famously and and uh and other places where you know you get these huge models that in certain axes seem to really outperform um other models work and I wonder like does do you think like you know you if you roll that Trend forward 10 years does kaggle stay relevant like there is there still a role to play for someone you know who doesn't have access to like a massive amount of compute resources to to solve a problem in a useful way yeah so this is a great question um and obviously you know the uh what's going on in the last couple years in terms of you know um true Leaf large-scale uh language models or other multimodal models is uh yeah it's definitely changed the world in a couple of ways one of which is it's changed the world of how some research is being conducted and I think that the the world of high energy physics is a useful parallel now there's there are some kinds of uh so I'm not a physicist I'm just going to say some kinds of physics uh that can only be done with something that looks like a linear accelerator uh where you need to get a couple billion dollars from a local government and build a you know several kilometer mile long concrete tunnel under some hopefully stable part of the world uh so that you can run these these you know incredibly expensive experiments um to gain certain kinds of knowledge um and this has definitely changed the way that some parts of the field of physics Works uh there's no question about it and among other things the world of physics had to get good at doing this kind of research and to have you know in some places a little bit more of a hierarchy on you know how experiments get proposed how they get evaluated um not on their results but whether they should be run at all you know what what gets into the pipeline uh who makes those calls and things like that and I think that we're seeing very similar developments for some kinds of machine learning research but there's plenty of physics that you can do in the world as far as I understand it that doesn't involve um you know having access to a uh a super collider or things like that um and similarly I believe that there are and will continue to be a lot of machine learning that doesn't rely on having access to sort of you know um collider scale uh resources for machine learning um and that can look everything yeah it can look things like uh you know what do we do for resource constrained environments uh so models that need to run you know in the browser I need to run on web devices need to run um on uh distributed edge-based uh things you know my guess is that we probably don't need collider scale resources to train tiny tiny models um what do we do for models that need to be fine-tuned in one form or another or even um yeah things like prompt tuning uh you know where we might have a uh a very large scale model at our disposal but then we need to figure out how to use that model as effectively as possible for a given use case something that I think will be reasonable to attempt for lots of people in specialized domains for a very long period of time uh you know at least as far as I can see forward um the last thing that I'll say here is that it's also useful to think about standards of evidence and verification for these very large scale models and that if you know I'm trying to think of uh how we would go about verifying that a given model um you know we talked earlier about uh the kinds of verification and you know moral equivalent of unit tests and things like this that might need to be put into place um I can't think of too many better resources than a community like kaggles to attack the problem of how do we verify a model that is very very large scale that might have many billions of behaviors or more than millions of behaviors that need to be exhibited in different uh kinds of circumstances to the stress tests to validate models and you know can those be framed in terms of competitions and resources uh other things like that absolutely right so I think that the kaggle community will be uh increasingly relevant over time for these reasons now that doesn't mean that every kaggler is going to you know train a model you know with um you know x million compute hours or things like that that's probably not realistic and probably wouldn't be good for the world if it was but I think there's a lot a lot that we can do that will still add value I guess a lot of those lines do you feel like automl techniques um you know could displace the value of um actual competitions like I I feel like in in the past the winning kaggle strategy was typically to do the best feature engineering but I wonder um actually I wonder if that's still the case and then you know in in these worlds where you know you have these gigantic models that are sort of doing their own feature engineering it's one way to look at it and then Auto ML on top of that what's what is a kaggler to do 10 years yeah yeah exactly so um look automl is a really important tool in the same way that uh hyper parameter sweeps just to take an example at random is a really important tool right um I believe that automl and you know uh useful hyper parameter tuning engines and things like this um do a great job of automating the kinds of work that isn't particularly interesting in machine learning um you know in the early days uh I spent a lot of time being a manual hyper perimeter tuner and it wasn't that rewarding um but the more fundamental questions of what data should be going into a model to train it for a given task um how should we be thinking about data distributions and structures um what are the right structures for a model to capture um you know useful uh you know causal Concepts in addition to just learning from the value correlations as possible um even you know deeper questions of like what is if we're doing say you know fine-tuning of a large pre-trained model like what is the right way to to set that up how do we create the right sets of targets how do we choose the right pre-training base to begin with all of those are interesting questions that I don't think that an automl pipeline is likely to solve you know exhaustively uh in the place of human judgment um in the foreseeable future so I'm very happy for humans to focus on human problems uh and you know places for human judgment and insight is going to be most valuable and where there's drudgery let's automate it and no problem with that well thank you so much we always end with two questions and I want to make sure that I I get them in um and the the second to last question is pretty open-ended um but I'm curious if you think or I'm curious what you think is an underrated aspect of machine learning or something that if you had more time you'd like to spend some time looking into yeah so um I think the thing that is most interesting in machine learning right now is making machine learning be robust to shifting to data distributions uh and so this is where a lot of my work was in my last couple years in Google brain yeah as we you know talked about at the beginning you know when you break that IID assumption between test and train data you have many of the theoretical guarantees that under and supervisory machine learning go away um but we still need things to work and so yeah I think that this is you know absolutely the most interesting area right now for um current work is is figuring out ways to be robust to shifting data distributions and this isn't some sort of weird abstract problem right it's something that happens for every deployed system I've ever seen it also happens for things like machine learning for scientific discovery so if you're going to do machine learning to guide say proteins design or drug Discovery or or any other sort of generative process you know by definition you're going to be moving out from your world of known things because that's the point and so how do we make sure that our models are going to be holding up well to those you know unknown areas that are super important for for advancing saying you don't keep problem areas like like drug Discovery I think that's that's really you know one of the most important areas as far as I can tell do you have like a favorite paper on the topic that we could Point folks to or resources to learn more about that um yeah so we just uh put a paper out um it's the last paper I was involved in the brain um uh called Plex that's looking at sort of a uh a unified view of robustness to data set shift um you know starting with pre-training and then augmenting with a bunch of other Bayesian methods uh with many many excellent co-authors including uh uh Jasper snellkin uh Justin Tran and apology awesome um and I guess final question is when you think about um actually making you know machine learning models really work in the real world you know today in 2022 where do you see the the biggest Gap or the hardest the hardest part of that from from going to like you know kind of kaggle winning model to deployed and useful for someone um in the world yeah so I think what's interesting is that you know uh people like you have have put a lot of infrastructure in place that make things that used to be quite difficult you know uh pretty straightforward now and so you know the challenges of like how do I get a model into production um yeah there there are plenty of packages systems platforms cloud-based Solutions you know you name it that can help people do that um I think that the pieces that are more difficult to solve are really about how do you make sure that that model is going to be a model that you're proud of over a period of time um and you know where that's most obviously you know uh comes ahead in terms of robustness uh which you know might be in terms of data set shifts might be in terms of fairness uh might be in terms of inclusivity or things of these forms but making sure that our models are acting the way that we want them to in a wide variety of deployment situations uh is currently I think much more difficult than just sort of the the mechanics of how do you get a model into production because of the work that's been done on on infrastructure and uh um in so many different areas a thank you so much this is really fun interview I really appreciate I really enjoyed it thanks so much if you're enjoying these interviews and you want to learn more please click on the link to the show notes in the description where you can find links to all the papers that are mentioned supplemental material and a transcription that we work really hard to produce so check it out,10917 +"Emad Mostaque — Stable Diffusion, Stability AI, and What’s Next",https://www.youtube.com/watch?v=bG5hTokyh5Q,4229,2022-11-15,"Emad: We have to decide what should be open and a public good. This is not from a business perspective, but from a societal perspective, is what should be closed? Should the tools to allow anyone to be creative, anyone to be educated, and other things like that be run by private companies? Probably not. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Emad Mostaque is the CEO and cofounder of Stability AI, which is one of the most exciting companies in the AI space right now. Before that, he was a hedge fund manager, and before that he was an engineer and analyst. This is a super fun interview, I hope you enjoy it. Lukas: Alright, do you mind if I just go “rapid-fire” questions? Emad: Yeah, sure. Go for it. Lukas: All right. Emad: Good to see you, Lukas. Lukas: Good to see you, Emad. Well, I think we need to start with defining, in your words, Stability. I think everyone probably has heard of it, but everyone seems to have a slightly different impression of exactly what the company is and what it does. So let’s hear from the source directly. Emad: Yeah, so our official mission at Stability is to build the foundation to activate humanity’s potential with a motto of, “Let’s make people happier.” Stability was basically set up in the belief that these new models that we have m — these transformer-based models and similar — are essential for basically unlocking people’s potential in some of the most powerful tech that we’ve seen, and the belief that having them open source so people could build on them and use them was not only a great business model but essential for closing the digital divide and getting this out as widely as possible. So, we basically catalyzed the building of open source AI models, and then we take those models and we scale and customize them for customers. And that’s what we do. Lukas: How did you get started with this? Your background isn’t originally in AI — or is it? Emad: So, actually, I started my career in math and computer science at Oxford. I was an enterprise developer in my gap year. Then I did hedge fund managing for many years. So I was a huge AI and video game investor. But then I took a break when my son was diagnosed with autism, and I used AI to do drug discovery. So biomolecular pathway analysis of neurotransmitters and literature review to repurpose drugs to help ameliorate some of the symptoms while advising a bunch of hedge funds and others on governments on AI and tech and geopolitics, et cetera. Going through that experience — that was about 12 years ago that I started that — it was super interesting. And then we saw that a lot of the technologies were evolving, but not until the last few years has it retaken off, obviously. So, I went back to running a hedge fund after that, and it was fine. And then a couple of years ago I was one of the lead architects of CAIAC, which was Collective and Augmented Intelligence Against COVID-19, which launched in Stanford in July of 2020 to take the world’s COVID knowledge and then use AI to compress it down and make it useful. That’s when I first really got exposed to, again, some of these new types of models. I was like, “Holy crap, this is huge. And they’re getting good enough, fast enough and soon cheap enough to go everywhere.” And, “Does it make sense that all this tech that’s so amazingly powerful is going to be controlled by big companies and they believe their edge is that?” Not really. Let’s go away from that. So, I’ve got some AI experience and others, but mostly what I do is see big pictures and big patterns and then put them together. A bit of mechanism design, as it were. Lukas: That’s cool. You’ve had such a meteoric ascension in the collective consciousness. I’m curious, has it happened exactly how you drew it up or has it been surprising? Like, when you started the company, what were you thinking? Because it wasn’t even that long ago, right? And then how has it unfolded differently than what you’d expected? Emad: Yeah, so we had the idea of Stability three years ago. The first thing my cofounder and I did was we took the Global XPRIZE for Learning, which is a $15 million prize, for the first app that could teach literacy and numeracy without internet. That was bought by Elon Musk and Tony Robbins. We were deploying tablets into refugee camps, saying, “What happens if we use AI to make this better and more powerful?” We didn’t use AI yet, but we just finished our RCTs showing literacy and numeracy in 13 months of education on one hour a day of being taught for refugees and camps. Lukas: Wow. Emad: There’ll be some big announcements about the “AI-ification” of that next year. But that was, like, a fuzzy one. Then, we set up Stability properly two years ago to do the United Nations-backed AI work on COVID-19 and fell into a lot of bureaucracy and other things. We really kicked off properly literally a year ago. I think nobody at that time would have expected that it would have gone like this. Like, originally, we helped support the communities at Eleuther and LAION and others. And the thinking was, like, this is a web3 Dow of Dows. Like, “Let’s reward all the community members and get them together.” But then, after a month or so, we realized that commercial open source software of scale and service was the way. And, while I was funding the entire open source art space, I thought it would be at least until next year that we’ve got anywhere near the quality that we’ve seen now. So I think there’s that pace of compression of knowledge and the ease of use and being able to get some people’s devices. But that surprised me, because I thought it would be another couple of years, at least, before we got there. But, I think that’s been the main catalyst, right — Stable Diffusion being the first model that is good enough, fast enough, and cheap enough that anyone can run. Like, it’s a two-gigabyte file from a hundred thousand gigabytes of data. That was the insane thing that I think has allowed it to go off massively. Lukas: Is it an accident that the name Stable Diffusion and Stability AI are connected like that? Emad: Well, so, this is an interesting thing. What’s the actual role of Stability? We’ve got over a hundred people. We’ve got some amazing researchers. But our role is a catalyst in the community, right? So with Stable Diffusion, it built on the work of the CompViz lab at former University of Heidelberg, now LMU Munich, under Björn Ommer. And so the two lead authors of Stable Diffusion were Patrick Esser at Runaway ML and then Robin Rombach who works with us. They came up with the name all themselves, because — we provide computer infrastructure support, and then obviously [??] themselves there — but we always try to give developers lots of flexibility, especially when working in these collaborations. It does get complicated there. We can discuss that a bit later. And they came up with it, and I was like, “Yeah, I love that name. Go for it.” But at the same time, there is this inherent tension, because a lot of people want us to manage the whole community, but that’s not how open source works, right? The whole thing about open source is that there’s lots of different things, even if you’ve got, like, Linux or Red Hat or something like that. And for models, it’s also a bit different. Because with normal open source software, you have loads and loads of contributors. Like hundreds. Thousands. You don’t really have that for models. You can do the whole thing just with a team of two to ten people. Or, if you’re like lucidrains, you do that all by yourself. He’s one of the developers that we support. He just cranks out models every day. If you’re a programmer that wants to feel bad, go and look at github/lucidrains for productivity. Lukas: All right, I’ll put a link in, but I don’t want to look at it right now. I’ve been very unproductive over the last few years. Emad: Yeah, it’ll make you feel terrible. Like, “Ah, jeez.” Lukas: I was curious about exactly, like, how that interaction works today with people building models. What’s your way of working with folks? Emad: So yeah, I think that the best way is always collaboration. We have our supercomputer cluster here. It was four thousand A100s originally. Now it’s going much, much larger, because I view that as a key unlock, and then the infrastructure support to make stuff usable there. We had the communities that were spinning out into independent foundations like Eleuther and others where we provide employment and benefits and equity, et cetera. And then collaborations with academia and non-academia independent researchers. I think the goal for the open source side of things is to put a lot more structure around that. So everyone knows when stuff is meant to be released, what happens if you’ve got ethical concerns, and things like that. But again, really be a catalyst for the community. Some of the models you’ll see released over the next period are entirely Stability models. Some of them are combination models, but we want to make sure that these things are clearly defined because otherwise people get sad. And it’s understandable, as well — attribution should be given. One of the unique things that we have brought in, though, is that we’re building an entire infrastructure to be able to scale and train these models. And if we do inference on any open source model, we actually put aside 10% of the revenue from that for the developers. So 5% goes into a community pool that will be activating in a month or two, where every developer affiliated with Stability can vote to allocate to the coolest research they can find. And half of it goes to the developers themselves, even if they don’t work at Stability. So again, we’re really trying to give back a bit to the community and recognize the authors’ things — and they can donate it or whatever — from that angle, and trying to make it so it’s clear how we interact with these. Because we are the fastest providers of compute, support, technical support and input of anyone in the market. You could access super compute before, but it was only really through these giant clusters with, like, 6 to 12-month processes for application, like, from JUWELS — which is pretty good — to Summit — which is much more bureaucratic — and some of the others. And that obviously doesn’t keep pace with the pace of AI development now, which is literally exponential. This is why…what happened is that a lot of academics basically had to leave, to either their own start-ups — which, as you and I both know as CEOs, is incredibly difficult — or join a big tech company, which isn’t so much of an option anymore given the freeze that’s going on. And that was it. And then, that doesn’t fit with academia. So academia is one area that we’re supporting in general. And again, I think compute is the key unlock there. But over time, it’s going to be increasing the infrastructure side of things and having standardized stuff. Like, right now, not everyone uses excellent tools like Weights & Biases, for example, to track their runs. We would like to move to more and more open runs so you can actually see how they’re doing, like BLOOM did with their updates, et cetera. So there’s a lot of work to go, but we’re trying to be as collaborative as possible. Lukas: Say I’m a researcher and I have an interesting area of work, and I’m looking for infrastructure support. How do I apply to Stability, and how would you view my application? Like, what would you consider? How would you decide whether or not to fund it and how much to fund it? Emad: So the way that we do it at the moment is that if you’re an active member of any of the communities — from HarmonAI for music, to Eleuther for language models, to LAION for images — you’re most likely to get compute that way. And that can be from an A100 up to 500 A100s, depending on how good your thing is, particularly if you bring in members of that community as your team. That’s the primary way. Right now, we’re setting up a grant-making portal, and we’re working with certain universities in that regard, but then also trying to figure out how we do, like, large clouds of almost “Google Colab on steroids” to allow people to unlock things from day one. This fits in, as well, with the next stage of our program, which is that we funded a handful of PhDs so far who’ve been active members of the community. We’re planning to fund 100 in the next year. And they will come with dedicated compute support for their labs and their projects, as well. And there’s an independent board being set out for deciding that because, again, one of the tensions is always going to be our business side versus the broader side. Like, why are we funding OpenBioML? Because it’s useful. There’s no business logic to it at the moment. But we want to keep that mix of supporting the entire ecosystem so we have a nice place in it and then focusing on some of the business stuff, which is generative media at the moment. So I’d say for the moment, generative media, if there’s anything interesting you can just reach out on the communities, and we fund most things in there. The other stuff, we’re building up the infrastructure, but just join those communities. Join the OpenBioML and other communities and contribute. That’s the best interview of all, right? You’re more likely to help people who help your communities. Lukas: And then, like, what’s required of me? Say I’m someone with a new idea for generating, like, awesome music. Does that mean that I need to contribute my model to the community after it’s done training? Emad: No. We encourage open source, but a large part of it is open access, as well. Like, we have incubator arm coming, a VC arm, and others for those who don’t want to go open source, but we heavily encourage open source. I think not everything needs to be open source. What needs to be open source is the benchmark models. It’s like, “leave nobody behind,” but the reality is that open source will always lag closed source. Midjourney just released version four, which is amazing, right? And DALL-E 3 will come out soon, which will be even more amazing. Why? Because they can take open source basis and go ahead, or they can just do something different. So, Midjourney version four was completely different, but Midjourney version three with Stable Diffusion was a mixture of the two. So you will always get this iterating, where open source will lag behind. We’re just trying to make it so the lag is minimal and people start on that same basis. But for people who come and use our cluster, the priority for the first cluster is open source, but we’re going to have more clusters where they will also be for the companies that we’re incubating, our own use, and other things like that. Yeah. Lukas: How do you think about the sort of broad buckets? It sounds like you do it by use case. It seems like you’re good at recognizing larger scale patterns. Do you have an opinion between the value of investing in infrastructure for audio generation, image generation, these large language models? How do you even approach that question of allocation? Emad: Right now, I would say, from a business perspective, media is by far the most lucrative, and that can fund a lot of other stuff. So Google and Meta have amazing research labs that they fund through advertising. That’s basically… we all hate advertising. Advertising is manipulative, and particularly with these new models has become even more manipulative. The area that we’ve focused on is the world’s content. So audio, video and others, those will all be in foundation models in the next five to 10 years, and we're focusing on that to fund everything else. I think that’s a reasonable model because the Disneys and Paramounts of the world will eventually have to transform their entire archives. Like the VHS to DVD uplift on steroids, because you know how difficult doing these models is. So that’s our core focus from a business perspective. From an impact perspective, it’s not more difficult. This is also why, like, one of the things we’ve done now is, again, within a year, we’ve built this giant cluster. So 4,000 A100s isn’t the largest private cluster. But on the public top 500 list, it’s in the top 10 probably. Like, JUWELS Booster with 3,744 is number 11. The fastest supercomputer in the UK, Cambridge-1, is 640. The same with Narval in Canada, for example, and NASA’s got about the same. So this is a chunky old beast. The reality is that should be a public good eventually, and there is a national research cloud discussion led by Stanford and a bunch of others that say this is needed for US universities. I think it’s needed for international universities. And so hopefully we can figure out a way to transfer over there with this value function that you’re discussing because otherwise it turns into fiefdoms. Right now it’s quite a centralized thing, where we’re just like, “What can be most beneficial for the community and attracting assets to the community?” And this was media for us. We’re still doing the LM training, but large language models, I think, are less impactful, because language was already 80% there and we’ve gone to 90% there. Whereas a lot of this image stuff was like 10% there and suddenly we’ve gone to 80 and now 90% there, and so it’s a lot more immediate for people. This brings us to the final bit, which is that the nature of these models — and the data that they run on — is that they can do just about anything. So if you have them converging in terms of quality from different players and then an open source version, where’s the value? The value can’t be in models if they can do anything, right? The value has to be elsewhere. And so that’s going to be very interesting to see, especially from, like, say, the societal value versus business value. Lukas: But what’s interesting is, as far as I could tell, the main thing that you’re doing — the thing that you’re really passionate about — is democratizing access to creating and opening up these models. So if the value isn’t there in your mind, how do you think about creating a long-term sustainable business? Emad: Basically, the value is in going into Hello Kitty as a business and transforming all their assets into interactive ones — it can be for the metaverse, it can be for new experiences, it can be for wherever — and then building tools to enable them to access their models and other people to access their models, piping it around the world. Our main play as a business is basically content and helping big companies with that and then helping everyone else through the software that we built. Like, DreamStudio Lite is just a very basic piece of software. DreamStudio Pro — that’s going to be released in late November — is a fully functional animation suite with storyboarding, and fine-tuning capabilities, and ability to create your own models, and other things like that. So again, being in that infrastructure layer and allowing the infrastructure to be usable is we’re at. Plus, of course, our APIs, which are industrial scale. We’re negotiating the cost down and down and down because the data in how models are used, as many people on this call will know, is as useful as the models themselves. Because then you can instruct them, and you can guide them down. The Carper team led by Louis has done an exceptional job in releasing the first open source instruct model framework. And now we’re training new models to be able to instruct them across modalities, as well, based on some of this data. So, I think that’s where the sustainable edge is: a mixture of content and mixture of experience. And the content, to give you an example: we have a deal with Eros in Bollywood in India, which is the Netflix of India, 200 million daily active users. All the Bollywood assets are going to be converted by us. And then all the music will pretty much sound the same, but it’s like… that data will eventually be converted, we’re just doing it five years before anyone else otherwise would have. Lukas: Sorry, when you say “converted” — converted into what? Emad: So, you take all the Bollywood music, and then you have a text conditioned audio model that can generate any Bollywood music. And that doesn’t need to be open source as a business thing. But then we can use the open source Dance Diffusion models and the new text condition ones we’re working on to be the framework for that. It’s like, you go and you do a MySQL database with someone, and they load their data into it, right? And they’re like, “Okay, well, I’m paying you to implement this” because that’s MySQL’s model, or PostgreSQL’s model or any of these other open source database providers or service providers. And that commercial software model is very well established. There’s an extra wrinkle in this, in that they load their data into a model that then converts it into a couple of gigabytes that they can then use for their internal processes and then external things. The extra wrinkle is that it’s hard. It’s hard to train these models. Even to fine tune these models isn’t that easy. We’ll make it easier. And the pace of model development means that they have to retrain every so often, as well. Until, I think, in image, you get to a steady state in two years. In video, probably three to four years. Audio is probably about two years, as well, for having a standard model in the space. Lukas: But what are you doing for the Bollywood application today? What’s the conversion that’s happening? Emad: Oh, it’s just like interest, right? Like, Bollywood is just… well, we can’t discuss it because we haven’t announced it, but basically, it’s more TikTok-type stuff and Snapchat-type stuff. And the things that you’ve seen with the use of Stable Diffusion right now is image-based, quite static — but it’s inevitable that entire audio and movies will be created using this technology. Not zero-shot, but in a pipeline of different things. This is what we’ve seen with, like… you take EbSynth, Koe and Stable Diffusion, and you can map a monster onto your face with the full thing. That’s the type of thing that we’re thinking around this. Most of the Bollywood stuff is going to be used internally now to save costs on production. And then over the next few years, you will see it go from cost to product savings to new revenue streams as people have new and interactive experiences across modalities. Lukas: So when you think about your own internal allocation of resources — humans, right — you have about a hundred people, you said? Is that right? Emad: Yeah. Lukas: How do you break down, like, who works on the foundation models versus who works on the commercialization? Or is that even the right way to think about what you’re doing? Emad: We split it into two, basically, whereby the researchers — who are open source researchers — actually have in their contracts they can open source anything they create unless we specifically agree otherwise. And they’re given a lot of independence and a lot of free reign to make mistakes. So we can say we went overboard on compute, but that’s what allowed us to experiment with different things. And we’ll continue to ramp that up because a lot of researchers are constrained by compute and other resources. So, it’s like one training run and they’re done, or they’ve only got like a 50% buffer or something like that. We thought that was the wrong way to have breakthroughs. Separate from that is the product and deployment teams — like, the customer solutions teams — because we don’t want product to influence research too much. Like, people are aligned in that they want to create a great business so it could be self-sustaining. But when you have product influencing research, you get bad outcomes. So the product team does its own thing. They work closely with the research team. And they have discussions to influence at a high level, but there’s no forcing function. You know? So it’s not like you have to have a model ready by this deadline in order for this product release. Because if you do that, you will never have proper research. So, that’s one of the ways that we split it out, and there’s infrastructure that supports all of them. Lukas: Do you worry about someone else coming along and taking your open source models and then building their own rival applications to yours? Emad: I really hope that other people release more open source models! That means that I don’t have to, right? Lukas: True! Emad: Because again, our role is to help grow these communities, and it’s to provide the support for people doing that. So, if someone wants to come along and create their own model, we can provide compute for them. There’s a lot of different entities that we’re providing compute for — who people would see as competitors — because I think this whole market just grows massively. Like, with Midjourney as an example on the art side, I gave a grant for the first A100s for the beta. And I said, when Stable Diffusion launched, they would be better than we are. You know? And it’s fantastic they are. Other people have had issues with quotas and other things. I’ve stepped in to try and help them, even though they might be viewed as competitors on the API. So again, I think the whole market will just grow massively. The key potential displacement point for us is basically another company coming and doing exactly what we do and supporting the community in this very strange way, and being decentralized and having this division. But then it’s like, why wouldn’t you just stick with us? I think our replacement cost is quite high. And the role of our company will change in the coming years. So now we’re a catalyst to make sure and force people to go open, as a forcing function. In a few years time, there’ll be more of a services company that is building Indian-level models for the Indians and Filipinos and other things, and for the largest content providers. And then I hope, over time, we move into being an AI platform — just making AI easy and accessible for everyone. Because all the models will be pushed to the edge. I think they’ll get smaller and smaller and smaller, and you’re seeing custom silicon in like an iPhone and all these other architectures, whereby a lot of these models will just be a few hundred megabytes big. And you’ve got your own model, and I’ve got my own model, and we’re interacting with big models in the cloud. I think that’s a really interesting flip of the internet. And that’s where we’re aiming for. I don’t think anyone I’ve ever seen is really doing the same. And even if they are, they might as well join us. We’re cool. We’re fun. Lukas: Overall, it sounds like you think models in the cloud will get bigger and bigger, but there’ll be smaller versions of them for ease of deployment and cost. Is that a fair statement? Emad: No. I think that if you look at the Chinchilla scaling paper, it basically sounds like more epochs of training, it actually means better data when you dis-aggregate it. I think data quality will become essential. I think the models will become relatively small. But then on the edge, they become even smaller. So it’ll be a hybridized experience. Like, when you use the Neural Filters in Photoshop, there’s a point of processing in the cloud and then it remains this render processing on your computer, right? Kind of this hybridized experience — or Microsoft Flight Simulator — will become quite commonplace for the running efficiently of these models. But I don’t think that models will continue to scale. Like, we’ll see a trillian parameter model or something like that. But instead, I think the most MOE approach — where you have multiple models that are good at various things — will be key to this. Like, right now, on the Stable Diffusion example, you’re seeing people using DreamBooth to create a GTA model, an Elden Ring model or something like that. That’s an optimal way, rather than having potentially one model that can do everything. But we’re not quite sure. And DreamBooth maybe isn’t the best way to do it. Maybe it’s hypernetworks or something else. But I think different models for different things in your own personal model — like, a million models — is the better way than one model that can do everything, even though that’s very attractive because it’s like, yeah, let’s just chuck it in. And we’ve seen these development of skills and you’ve scaled up. So, I think scale is everything. Now, I think data quality will be everything and model usage for instruct models will be everything. And the value is going to shift there, even as compute becomes plentiful to allow for ridiculous scale. Lukas: So you predict a reverse of the current trends of people building bigger and bigger models. You actually think they’re going to start to get smaller. Emad: Well, I mean, like InstructGPT is, at 1.3 billion parameters, as performant as GPT-3, right? Similarly if you look at FLAN-T5 and some of these other models that Google have released recently, the most performant models out there. Because, like, these are big neurons. You don’t need all of that stuff. Similarly, with the compute scarcity, relatively speaking, we just chucked a lot of random data into these things. But if you think of these models, like, a bit like the human brain, what’s better: just a diet of, like, every piece of media out there or just the media that you need? Yeah? And what does that look like for these models? We don’t know yet. Also, it’s moving so quickly that we haven’t been able to keep up. Like, a year ago, if I told you the image models would be like they are now, you’d be like, “No way.” Like, even I can’t believe it, right? And so this opens up a big question. Like, why is an image model… why is Stable Diffusion two gigabytes and 890 million parameters, whereas you’ve got 175 billion parameters of GPT-3? You know? What’s the amount of information they can convey? Does it make sense that text is so much bigger than image? Lukas: I don’t know. I mean, it seems plausible that it’s bigger than image. I mean, my understanding was that these models — at least the language models — generally get better on a broad set of benchmarks as the model size grows. But I mean, certainly other things matter. Emad: No, they do. And again, this has been shown. But then, like I said, the Chinchilla paper showed that they also get better as you train them more, for similar parameters. So, a 67-billion-parameter, five-times-trained model can outperform a 180-billion-parameter model effectively. But then you see other things. Like, with image models it’s the same. Google has a different type of model called Parti, whereby they scaled it to 20 billion parameters, and it learned language and things like that on the way. But, like I said, Stable Diffusion being this performant at just a couple of gigabytes and 890 million parameters makes you question, “What happens if we start optimizing the data?” Because we just chucked in an unfiltered dataset, relatively speaking — some of the bad stuff removed — just two billion images into that. What’s the minimum number of images to have Stable Diffusion quality output? Is it 12 million? The model that [??] released in December last year, CC12M — that was used for the original version of Midjourney and a lot of stuff — was only 12 million images. How many images do you need? How much text do you need? And then what effect does that have on the size of the models? I think it’s all scaling laws anymore. Even as, like I said, the compute becomes available now to scale infinitely. Like, some of the clusters I see being built are insane. Lukas: It’s sort of surprising — it’s interesting — your insight, maybe, as you put it earlier, was that people really needed this massive compute to make it broadly available. But then it’s an interesting contrast to your current prediction that the models will become smaller and more specific. Does that make you have any plans to sort of change resource allocation or the kinds of compute that you want to get ready for researchers? Emad: Yeah, I think we don’t basically don’t need to infinitely scale compute anymore. It becomes, then, about dataset acquisition, and we’re building out a couple-of-dozen-people data team to provide the right data for open source research. You can think data quality is underestimated in terms of its importance right now for these models because people are like, “Scale is all you need. Stack more layers.” And it was difficult to build a cluster, even in the thousands of A100s, just because there wasn’t availability. But now, you look at next year. I know of three 20,000 H100 clusters that are being built. An H100 is probably about three times as performant as an A100, so that’s like 60,000 A100s. Like, 15 times bigger than our cluster. They can train a GPT-3 probably in six hours or something like that, one of these clusters. So the computer’s no longer really a bottleneck, but I think what we’ll see, again, is that people will take the standardized models and customize them down and then have a load of different models, and maybe there’ll be one or two more big models. But, I think it’s not about big models anymore. It’s about optimal models, and we don’t know what an optimal foundation model is, across the data training and other architectural parameters yet because we’ve been so constrained by compute, data, and talent. And each of those is being unlocked right now. Lukas: That’s really cool. What kinds of datasets are you thinking about building? Emad: So, like, we’re talking to national governments about, like, national broadcaster data. You’ve got really interesting, highly structured things there that are high quality versus crawls of the internet. And these are public goods that should be available, right? Lukas: Sorry, what would that be? I’m not familiar with it. Emad: Well, so, you have PBS in the US, right? Like, their data should be available for model creation for academia, right? Lukas: Oh, I see. So, you would just acquire that dataset or somehow get a license to make it available? Emad: Exactly. To researchers initially, and then hopefully more people because, again, this is public. It’s paid for by the people. So it should be available to the people in various ways. So if you’re training a model on all the PBS radio station work, and they’ve all got, like, transcripts, you could do it in various different ways. You could create synthetic datasets off that. So looking at some of these media datasets has been quite interesting to us. But then in other areas, it’s about more than that. So like OpenBioML, we’re doing the usual protein folding, some DNA stuff, and supporting things there. But in bio ML, there’s just a lack of quality data. So one of the things we’ll probably do soon, we’re just deciding on this, is a prize to basically identify what dataset should be built and then bringing in external funders to help build those datasets. Protein folding was quite good because there was a great dataset, and there was an objective function of quality. And so people could build around that. So, you have OpenFold, you have [??] that we’re doing and other things to make that more and more efficient. Other things in bio ML don’t have that. Within the language thing, we’re doing the Pile Version 2’s. The Pile Version 1 from Eleuther was very widely used, and the Pile Version 2 is much bigger. With images, we had LAION. The largest image dataset was 100 million images — YFCC100M, which was the Flickr dataset from 2013. LAION did LAION-400M — which is 400 million images and text pairs — last year, and that was used by Google and Meta and a whole bunch of others in their models. That’s how good it was, because Google and Meta and others are actually constrained about using their user data because of FCC regulations and other things, weirdly enough. Now they’ve done LAION-5B, which is 5 billion image parameters — actually 5.8 — and they’re going to go even bigger. So, it’s creating these big open source datasets, replacing a lot of the scraped lower-quality stuff with some of this public sector data, encouraging others to contribute to it, and then building great datasets for every modality so that everyone, again, is on the same page. I think we’ve got to the point now where the communities that we support and our own internal teams are building better datasets, in some cases, than even private companies have access to. Lukas: Yeah. I think one of the disconnects that we see talking to a lot of researchers and companies — of course, there’s a lot of overlap in applications, and deep learning is incredibly practical in lots of ways. But I think a lot of companies are looking for more research around time series and structured data. Do you think about investing in that realm at all? Emad: We’ve had some approaches for time series analysis and things like that. I’m not sure these foundation models are the best things for that, to be honest, because I view them more like principle-based analysis in the brain. Like, with my son — with his autism, ASD — the main thing about that is there’s typically a GABA glutamate imbalance. GABA calms you down, like when you pop a Valium, and glutamate excites you. There’s too much noise. And then once you calm down that noise, you do repetitive trial teaching so that you can rebuild things like words, because if there’s too much noise you can’t learn the connections between concepts and words. Like, a cup is a World Cup, cup your hands, all the different cup meanings. And then you rebuild that. These models are the same, in that they can figure out the latest spaces or hidden meanings of connections between different labeled datasets. And with time series and things like that, I’m not sure this is the appropriate thing for that. Again, we’re funding a little bit of research in that area. But I think that a lot more of the classical ML things are a lot better to do that, because you typically don’t do out-of-sample stuff there. And, like, looking at hedge fund stuff, you are typically inferencing and extrapolating versus trying to do first principles analysis of, like, “What is a Van Gogh painting mixed with a Banksy painting,” and these types of things. But, again, I think 80% of research now in AI — I think this is in the AI index report that was released by Stanford — is in foundation models. So we’re one area of funding of this and, again, quite focused around media and language. There’s just a whole world of funding around this area. So if it is useful for time series, I’m sure we’ll find out sooner rather than later. Or maybe we won’t. Maybe they’ll just take it, run a hedge fund and be like, “Hahahaha, get all the money!” Lukas: Do you have an opinion on other architectures? Are you seeing anything? I feel like it’s amazing, the convergence around transformers and so many different applications. Do you see any signs of that changing, or no? Emad: Potentially there are some promising things that I’ve seen. You know, you don’t necessarily need attention, as some recent papers have shown — I’m trying to remember which ones. And there’s some Attention Free Transformers stuff being done with one of the projects that we’re sporting around RWKV on the language model side. But I think transformers are probably going to be the primary way of things for the next couple of years, at least, just because they got momentum and they have talent. And again, the commonality of architectures around this, you’re like, “Hey, let’s just chuck it at this or that,” and you’re like, “It works.” And we’re just scratching the surface. Like — for those who don’t know — for images, last year we had… well, the big breakthrough in January/February by Ryan Murdock and Katherine Crowson and some others was to take the open source CLIP model that OpenAI released in a generative model of VQGAN — that was Robin Rombach who did that one with his team, CompVis — and having a generative model and a model that takes image to text and bouncing them back and forth across each other to guide the output to get more and more coherent stuff. In December, Katherine postulated CLIP conditioning would be the best way… taking a CLIP model, the language model, and a diffusion generative model and combining them together, and somehow it learned the stuff. Then Google, with the Imagen team, took a language model T5-XXL — that was a pure language model — and mixed together with the diffusion model, and somehow it learned how to write images, and it got even better. Everyone was like, “Wait, what?” We still don’t exactly know how these things work, to be honest, and the potential of extending these. So I think transformers have a long way to go. But again, like, there’s a paper that — I don’t know if you saw it, the number of papers on arXiv — it’s literally an exponential with a 24 month doubling on ML. It’s just going crazy everywhere. Who knows what people are going to come up with. The interest in this area compared to basically the rest of the global economy means there will be more and more resources just deployed towards this, because it's finally actually showing usefulness. It’s just… where that usefulness and value will lie, nobody really knows. Until then, just take some data and chuck it into the H100s, stir it up, and see what pops out the other side. Lukas: It seems a little surprising that you have this amazing company that does all this cutting edge research in ML and model generation, and the first really big application is generating media. I never would have thought that a priori. Do you have other areas that you expect to take off or that you’re looking into? Emad: You shouldn’t underestimate media. The easiest way for us to communicate is doing what we’re doing now. We’re having a chat with our words. The next hardest is writing each other emails or chats. To write a really good one is very hard. Like, “I made this message long because I could not spare the effort to make it shorter,” I think someone once said. Lukas: Right. Emad: The hardest thing for us to do is communicate visually as a species. This is why artists are great. PowerPoints, we’ve all been there and stuck there. With the combination of a language model, a vision model, a language generation model, and a code model — you don’t need PowerPoint anymore. You can speak and create beautiful slides every time. With art and visual communication, anyone now… my mom can create memes and send it to me about why I don't call her enough in an instant. Like, humanity can finally communicate both through text now, with these language models — and you’ve seen how things like Copy.ai and Sudowrite and Jasper have made that easier — and now visually, as well. And the next step will be 3D. That’s a change in the way humanity communicates, which is a huge deal. Again, language was valuable, but it was already getting there. You already had help. Your Gmail suggesting to tell him to bugger off in your replies or whatever. Now it’s the next step there, which is image, and then 3D, and things like that will follow. That’s valuable because, again, we have to look at where the money is. The previous iteration of the web was all about AI being used to target you ads. Now it’s about something else, where you’re moving from maybe consumption to creation. My focus has been in this area as a main driver there. But, in terms of impact and global stuff, the ability to switch between structured and unstructured data dynamically at a human level because it understands the principles when combined with, like, retrieval augmentation and other things to check for factual accuracy, it’s such a huge deal because it means that you can write reports, you can do legal stuff, you can get rid of bureaucracy. It’s the first technology that enables so many things because it’s so general that we’re not sure where the value will be. But, I do see the value in anyone being able to express themselves and communicate better. I think that we shouldn’t underestimate that particular aspect of things. Lukas: I also wanted to ask you, and you’ve talked about this a fair amount, but I’d love to hear directly. You made this decision to make all your models really open, in contrast to what OpenAI and others were doing, which I think people got really excited about because it sort of felt like… I think with some of the earlier models, there were these gatekeepers. Like, no one could really access them. Some models, like, really no one except people at the company could access. I remember the reason that some of these models didn’t get opened up was said to be ethical concerns at the time. Do you think that there’s any merit to that argument? Do you think about that at all — like, models being used to trick people, or spam people, or things like that? Emad: Well, I think it’s a valid point of view. Basically, the logic there is similar to the logic of orthodox and ultra-orthodox religions, which say anything that leads to a sin in itself is sinful, and so just in case. But it’s understandable because these models are so powerful that you move from a risk-minimization framework where you’ve got an expected utility — “What’s the positives? What’s the negatives?” And you try to figure that out roughly, right — to a regret min and max. “If I release this model and something goes wrong, my company could get blown up.” I minimize my maximum regret. And we don’t know what it can be used for, because it can be used for anything. However, I think the last few years have shown this: GPT-2, too powerful to release. GPT-Neo and the other things come along. The world hasn’t ended. Stable Diffusion has been in the hands of 4chan now for 10 weeks, and all they’ve basically created is like Cronenbergs that have given themselves nightmares. Like, it’s not great at creating these things. The bad guys already have the technology. The nation states… Russia has tens of thousands of A100s, right? And the people can’t run them. So they can build it. And we don’t have immunity to this new alien technology being out there. Because, ultimately, we live in a society that regulates against stuff. So if you are creating bad things, you’ll be regulated against. If you are using it for bad purposes, again, the means of distribution, or the social networks, have rules and regulations in place. Because what you’re really trying to regulate is not content. Because bad content is bad. You’re trying to regulate behavior, and that’s about who’s allowed within these communities and not allowed within these communities. And all of this stuff gets mixed up. Then the other aspect of it is this AI safety alignment issue of the technology killing us all. I will say quite clearly, I think that GPT-4, when it comes, will be more dangerous than GPT-4chan. Because a model like GPT-4chan that was trained by Yannic on 4chan that produces just pure all rubbish isn’t really going to go anywhere. It’s just going to produce pure all rubbish a bit easier. Whereas a GPT-4 — which, God knows what it will be, but I’m sure they’ll do an amazing piece of work — the large models that they’re creating now are getting to human-level. And we don’t know how exactly they work. And they’re being created by unregulated entities with these models that are powerful as any technology out there. Small models are not the issue, being widely used in the communities regulating it. Big models are the issue. And we should have more oversight on that just in case some of this AI alignment stuff turns to be correct and these things are dangerous, which I think they probably are. Lukas: But you believe that these small models are also very powerful. So why would the regulations be different for the size of the model? Emad: Oh, because they’re not open, right? So when they’re open, everyone can check it. So right now everyone’s poking around and saying, “Oh, those artists. Are they going to be compensated on LAION and this and that?” And we’re like, “Cool. Let’s have that discussion in the open space. What’s the best mechanism to do this?” We’ve got a $200,000 deep fake detection prize coming up. We’ll give it to the best implementation of open source deep fake detection. It’s available for everyone, and everyone can be a part of it. Whereas the big guys, there is no control. Like, again, the example I gave a bit earlier. Imagine that Apple or Amazon or Google or someone integrated emotional text-to-speech into their models, right? So Siri suddenly has a very alluring-type voice and whispers to you that you should be buying stuff. You’ll probably buy it more. Is that going to be regulated? It’s not currently, and it won’t be in time. Whereas putting these models out into the open will get people to think about, “Actually, that’s something that probably should be regulated.” And if something is regulated, that is fine because it’s a democratic process. Whereas companies using this technology to manipulate us — literally, because that’s the advertising model — I don’t think is appropriate. And again, it’s not just Western influences and deep fakes and elections and stuff like that because when you look at that, there is a herd immunity thing, not a COVID-type thing and lots of work in COVID. People understand this technology will mean that people will be more discerning over curated outputs, and then it will be a mixture of this detection technology. And then, for example, we’re part of contentauthenticity.org, where all our future models will have EXIF files — well, special metadata files — showing that they are generated by default on the package. Now, will people choose to use them or not? They may choose not to use them, in which case they won’t have a tick next to them, right? So there are all sorts of ways to do this, but the reality is that again, this is a complex debate that cannot be decided basically in San Francisco. It’s something that is important because there’s technology inevitably around the world. And if you actually poke people, and you say, “Okay, so you don’t want this technology to be used by Indians,” they’re like, “Well, of course we do!” “When?” “When it’s safe to.” “Who decides that?” “We do.” “So they’re not smart enough to decide it?” “No, they need to be educated.” And then it gets really bad, right? But again, I think it’s understandable because it’s scary, and cool, and scary all at the same time. Lukas: Are there any applications currently of the models that you’ve built that make you uncomfortable that you would like to try to prevent? Emad: There was an example of a DreamBooth model being trained on a specific artist’s style. And so it was like a cute, Teen Titans-type artist, and it was announced and released as that artist’s model. But they had nothing to do with it. I felt uncomfortable with that because I don’t think that styles can be copyrighted, but it was, like, almost this co-opting of the name of that artist to do this. Like, eventually it got changed after discussion. There was a piece about that. We’re entering some of these gray areas where we have to decide these things, and we have to figure out things like attribution mechanisms and other stuff. DeepFaceLab has existed for years now. It has 35,000 GitHub stars for doing deep fakes at high quality. Maybe with this technology you can use it a bit easier, but that’s the inevitable pace of it. I think we have to figure out some of the things around attribution, around giving back, and around making sure that people’s things are used appropriately, right? Because — in general, with attribution and copyright and things like this — these models do not create replicas when they’re doing the training, if you look at how a diffusion model works in particular. They just learn principles. Again, styles cannot be copyrighted, so it’s very difficult to do that. But when it comes down to the individual basis, I’m still struggling a bit with how do we prevent that from happening, and people co-opting other people’s things, other than in a court of law. Is there any automated system? Because you have the ethical, moral, and legal. Community typically enforces moral. Ethical is a more individual thing, and we have a creative open air license for that. And legal is obviously a whole other thing. We don’t want things to get down to legal. It’s like, how can you encourage community norms? I’d say that’s probably the primary one here that just made me a bit uneasy. Lukas: I see, interesting. Do you do any — like, in your APIs that you offer — do you put restrictions in there that you don’t have in the open source models used just directly? Emad: No, 100%. So again, it’s regional-specific and it’s general, and it’s very safe for work, shall we say, because again — it’s a private implementation of an open API. Even with the models… like, Stable Diffusion ships with a safety filter that’s primarily pornographic/nudity based, just in case you’ve got an output that you didn’t like. Like, the new versions of it will be more accurate to reflecting what you want, and again, trained on potentially safer datasets, etc. But there’s obviously a different bar for a private implementation. Again, our basic thing is that these models should be released open as benchmark models with safety around it. So, like I said, there was a safety filter. If you trip the safety filter in the open source version, it shows you a picture of Rick Astley, and you can adjust the safety filter or you can turn it off. And then there’s an ethical use license. Any other suggestions for improvements there, we’d love to know. And again, I expect that this technology will proliferate, because we catalyzed it. There were the contributions from Patrick at Runway, from LMU CompVis team, and others, and it was led by those two developers. There’ll be a variety of models of different types being created by a variety of entities. And some of it will be safe for work, some of it will be not safe for work, but I think we need to try and figure out some standardization norms around this as this technology starts to proliferate. But again, that should be a communal process. Lukas: You know, you keep mentioning these communal processes. And I’m curious: what happens when the community has, like, deep disagreement with itself? I imagine that happens all the time. Like, how do you resolve a community where people might have really different senses of what’s moral and draw lines in different places? Has that happened yet in your community? And how do you expect to… Emad: 100%. It happened in the wake of the Stable Diffusion release. People were like, “This can be used for not-safe-for-work, and we don’t feel comfortable with that and supporting that internally within Stability.” And so we had a discussion as a team, and we decided not to release any more not-safe-for-work models as Stability itself. Some people weren’t happy with that. Most people were fine with that, but that was easier because it was a team decision. On a community basis, that comes under governance structures. So right now, one of the things we’re doing is we’re looking at a EleutherAI, and we want to spin that out into an independent community, because it’s got lots of different entities and lots of different points of view. What is the appropriate governance structure with it? Is it Linux Foundation, PyTorch? It has a lot of OSS things. It’s a bit different because these technologies are not like… what can you do with Linux, really, right? Lukas: Yeah, exactly. Emad: Whereas, what can you do with the most advanced language model in the world? It’s a lot more complicated and needs a lot more voices there, and that’s why we’re taking some time just trying to say, this is a governance structure in day one. But we need to make it adaptive because we’re not sure exactly where this stuff will go. Right now, we as Stability have a lot of control over GPU access and a lot of this stuff. It’s the spice. That shouldn’t be the case going forward, because no one entity — whether it’s us, OpenAI, DeepMind or another — should have control over this technology that’s a common good. So, again, we want to be contributors to, like, an independent not-for-profit, as it were, as opposed to controlling this technology, and then have our part in supporting and boosting it being open source. I think eventually what will happen is if people really disagree, they’ll just fork. We’ve seen that in various communities. Just fork it, right? It’s the beauty of open source. Lukas: Yeah. Emad: And you can go and do your own thing. Lukas: Although I imagine it might be easier to fork a model because one or two people could like take it in a different direction. Emad: Yeah. I mean, you can fine tune models. You can fork models. I think the key thing here is benchmark model. That’s a lot of compute up front, right? And then fine tuning and running it is relatively little compute. This is the opposite of the current paradigm of Google or Facebook, which is relatively little compute to get it into database structure, and most of the compute is done at time on inference. So you can take a Stable Diffusion model right now and you can train it on your face with 10 images or 100 images and then boom, you’ve got your own like Lukas model that can put Lukas in anything, right? Lukas: Yeah, that’s super cool. Emad: That’s a flipping of the entire paradigm. But that isn’t a forking of the community. A community fork will be disagreements over safe-for-work or not-safe-for-work as the datasets, “crawled or licensed,” or things like that. And I imagine we will see different communities around this, around some of these key questions. Lukas: Although what’s tricky maybe about this, and a little different than other communities, is you’re holding this very valuable resource in terms of compute. So at the end of the day, you will have to arbitrate more aggressively, maybe. Like, for sure, anyone could easily fork stuff, but then they would have to potentially ask you to get the compute resources to really make a meaningful fork, right? Emad: Yeah, 100%. Right now we have a lot of control, because we’re the fastest supplier of compute. But a part of what we’re trying to do as we spin these off independently is make it so they can access their own compute and also stimulate some of these national clusters to be more open. So it doesn’t take six to 12 months to get A100 or H100 access anymore. I think, again, it deserves to be a bit more diverse. So multiple parties at the table as opposed to centralized. And this is a deliberate action by us to move towards more and more distributed end decentralization, both from an ethical and moral perspective. But then, also, like I said, from a business perspective, it works for us as well. Because if we’re considered to be in control of everything, like, we don’t know what’s going to happen there. And it’s really a lot of effort to coordinate an entire community, but likely won’t be positive, because it’s going to be a lot if this goes to a 100 million, a billion people, as we expect. Coordinating all of those. Instead, it should be an independent entity doing that where all the voices can be heard. And we’ve got our own part to play within that. So, we go from being the main provider of compute, to being a provider of compute, to — hopefully — all the computers provided by the world effectively do this properly because it is a public good. And that’s good for us because it saves our costs, right? The open source models get created without cost to us. Lukas: So you imagine a world where a huge fraction of the world’s population is training models. Did I understand that right? Emad: No. I think everyone in the world will use these models. I reckon there will be, like, thousands of developers creating these models to certain standards established by the various communities and others in interrelation with each other. So you will have standard benchmark models like, Red Hat version seven or something like that, or Ubuntu 20. Like, there will be regular releases of these models. It will be independent. The countries and others will provide the compute for it. We’ll be one of the voices at the table doing our little bit. And then people will build on those benchmark models and fine tune them for themselves. So, on the Apple architecture, like I said, there is a neural engine that’s not really used. Others are having these same foundation model engines that are coming through. So I think in five to 10 years, you will have AI at the edge, AI in the cloud, and the hybrid interaction of those two will be super powerful across modalities. This is also one reason why we are fully multimodal. If people are like, “Why don’t you just focus on image?” Because you don’t know where the learnings will come from or the value across all of these. So, it makes sense for us to be that layer one infrastructure layer there to get things going and then have a business model on scaling this. Lukas: Yeah, that makes sense. I want to make sure I asked you about education. I mean, that comes up every time we talk, it comes up in every interview. It’s obviously something that you’re super passionate about. How does education fit into Stability? Emad: A large part of Stability, for my own personal focus, is around the rights of children, because a lot of ethics is complex. And things like that. But we all agree that children don’t have agency, and so they have rights. I’m not talking about the effect of altruism in million years from now. I’m talking about kids right now, today. And I was like, “If I go to the future and bring back technology to make kids lives better, what would I do?” I’d allow them to create any image and use these tools, allow them to do code. You know, the type of stuff that Amjad at Replit does. I would allow them to communicate and be educated and have healthcare. So with the education thing, it was like first proving that an app on a tablet could actually make a difference, which we’ve done now through the RCTs. Now, it’s about bringing the world together and saying, “What’s the best darn experience we can have to teach these kids?” Because it doesn’t make sense that we teach arithmetic in a different way across every single country. And we don’t know what the best way to teach linear algebra is. But then having an AI model that teaches the kids and learns from the kids at scale — because you do entire countries at once — is the best data in the world for creating national-level models. So, if you want to create a Malawi model, you need to capture the Malawian culture and all the contexts. And if you’re trained by little Malawian kids, that’s a national-level resource. So this is what I discussed in [??]. Like, we’re not feeding AI models the right things. We’re feeding them a mishmash of stuff, but if we actually intentionally create data that teaches them to learn, that’s going to be the best models out there. And similarly, like I said, the discussion that we’ve had about AI models going to the edge, having control over the hardware, software, and deployment means that we can standardize these tablets to be little AI machines, which will be amazing, because they’ll have a richer experience than anyone else. And I personally think, like — I don’t know if you’ve got kids Lukas — but 13 months, one hour a day, you learn literacy and numeracy is good for any kid anywhere in the world. In a refugee camp, where people earn a few bucks a day at best, like I think Malawi’s like $5 to $10 a month. It’s crazy, especially when you’ve got one teacher per 400 kids. How else are you going to educate them other than this technology? How else are you going to do it other than creating an open source standard that’s scalable and working with the World Bank and others to scale it? I think this technology has a huge role to play in education. I think that incorporating into the West will be incredibly difficult and an uphill battle. Taking it where ROI is the largest in terms of emerging markets and places like that is going to be the best. And then we’ll create a system that’s better for everyone. Because again, we have to decide what should be open and a public good. This is not from a business perspective, but from a societal perspective, is what should be closed? Should the tools to allow anyone to be creative, anyone to be educated, and other things like that be run by private companies? Probably not. They should be a public good. Should they be United Nations and other bureaucratic hell holes? Probably not. So, with this technology coming right now, there’s a little window where we can create better, more adaptive systems and bring them to the people where it can have the most value, and that’s what Stability is focused on. Because I think they could build a real infrastructure for the next generation. Lukas: Just to be concrete about this, you’re imagining making a tablet that has an AI teacher that’s literally talking to students and teaching them things like linear algebra? Emad: Yep, I want to call it “One AI Per Child,” but others are against that. But that’s the concept. You have an AI that helps you. Because what is AI but information classification? So what’s the information that can help that kid, be it in Malawi or Brooklyn, to the next part of their journey? And then having a standardized architecture for that so you can take what works in Malawi and you and apply it to Ethiopia, apply it to Benin, apply it anywhere. Makes sense. And the output data of that is customized datasets that are ideal for local language models, and local image models, and local video models if you execute correctly. So, this is why I think we are not OpenAI or DeepMind. We don’t train giant models. The entire focus is AI that’s accessible for everyone. It’s emerging markets and creativity. These are our two focuses. Again, like I don’t really care about AGI, except for it not killing us. I don’t want to create a generalized intelligence. I want to create specific intelligences that are widely available so we close the traditional divide and makes people lives better. That’s the key focus and lodestar of what we do. Lukas: That totally resonates with me, but don’t you feel like that the trends lately have been creating better specific intelligences through creating better general intelligences? Like, I’ve been watching, the last 20 years of machine learning seems like more and more general purpose things that are then fine tuned on specific applications. Do you expect that trend to change? Emad: I think it’s an arc, right? So it was “scaling is all you need and more layers,” and now it’s better datasets, right? And so as you have this adaptation, I think the intelligence goes to the edge. I think instructs and the combination of reinforcement learning and deep learning is the next big trend that we’re seeing start to accelerate. And again, that’s why we’ve got CarperAI as our representative contrastive learning lab. I think it’ll be loads of models. Because these big models were there, but they weren’t really used, right? Now they’re being used. So Stable Diffusion is being used probably by, what — it’s being used by millions of people each day. As it gets better and as people release more models, this technology will be used by more and more people, be it private or public. And so I think that then it becomes about inference and cost, because if you got a model that’s open source, and 80% is a closed model — because open source models will always be worse than closed source models, because you can always just take it and make it closed and trade it on better data — then that creates a different paradigm. And again, I think it was this breakthrough point whereby “stack more layers” became less effective as you went up. Now it’s a case of “make the layers more effective,” as it were, and figure out how do we optimize these models if we can start doing A/B tests and training 10 of them at once. Where are the key optimization points here? I think that the optimization points will be a model that’s used by a million people, versus a model that’s used by an internal team. A million people will always win, because people will figure out all sorts of tricks. Like DreamBooth training, so that’s where you take a few pictures of yourself and it’s fine tuning for the image model. That was 48 gigabytes — when it first came out — of VRAM requirement. After three weeks, it was eight gigabytes by the community building on it. And having that, and having hundreds and hundreds of developers hacking away at these things and figuring out how to put them into process, as opposed to zero-shot — these won’t be the best for zero-shot, but they will be more useful because they’re in pipelines. I think that we’ve shown that with Stable Diffusion versus other image things which are within their thing. But now we just have to upgrade the models again. Lukas: I have one more question that I didn’t actually prepare, and I’m curious if you have thoughts on: which is that, you’ve talked about your autistic son a few times, and I actually have a little sister that’s autistic. Autism has come up in many of these interviews that I’ve done often, like autistic family members. I’m curious, do you see any connection between autism and machine learning? Emad: 100%, and this is why I really love transformer-based architecture. Because, what I did with my son in terms of repurposing drugs for him — and we’ll do a full formal thing about this in the next year or two where we’ll share all the learnings — is about reducing the noise and getting him to pay attention by reducing the imbalance. So there’s too much glutamate, making him excited and not enough GABA calming him down. And then having things like applied behavioral analysis, where he does rapid iterations to learn that a “cup” means things in various things with a variable reward schedule where he gets rewarded at random so he’s more motivated to rebuild these things. It’s similar for a stroke victim and other things, but, again, you look at what these machine models do with transform-based architectures. Attention is all you need. They pay attention to the important parts and that interconnection of creating latent spaces or hidden layers of meaning is exactly the same almost. Well, it’s not the same, but the same principle as what we do for rebuilding the language capabilities of our kids. And so this is one of the things that really drew it to me, and I was like, “I kind of get that.” Like, I have Aspergers myself, so I had to rebuild and refigure out a lot of stuff. Principle-based approaches. That’s why I was like, it’s almost like type-one versus type-two thinking. Retrieval versus instinct. A combination of those is the most powerful combination we’ve ever had as humanity. And again, I think that it’ll really be able to help with this. The other aspect of it is personalized medicine and education and other stuff. We don’t have enough teachers. We don’t have enough doctors. These technologies are reaching human level in very narrow fields. What if we could put this on tablets out there? “One AI Per Child” doesn’t just mean, like, something… it’s literally an AI that can help them in everything if they’ve got special needs or if they’re neurotypical or anything like that and personalize the stuff for them because our education system treats everyone like a number. It’s like ergodic versus the non-ergodicity of humans. Like, tossing a thousand coins at once is the same as tossing one coin a thousand times. The reality is we are all unique, but we didn’t have the tools to personalize until now. This is the first technology that could do that. So in doing that, we can figure out systematic diseases and conditions like autism, like COVID and others. This is why I focused on COVID. This was a multi-systemic disease that modern science wouldn’t be able to deal with. Like, why do you have massive ferritin levels and other things in the blood? Is it serotonin syndrome? Is it this or that? The first principles analysis of COVID is even still lacking today. Thankfully, we found treatment and, again, models are one science, but information isn’t getting to where it’s needed on a personalized basis. And, again, we can build systems for that. But AI models are only one part of that. It’s more classical open source AI for the rest of it. So, yeah, I think there are parallels to this. And of course, being in our industry, it is very, very prevalent, right? It’s like a double-edged sword. Lukas: Well, I’m curious, do you think your Asperger’s has given you some advantages in building this really unique company? Emad: Yeah, no, 100%. Like, my real skill is mechanism design. I know how to convince governments and multilaterals and others. Like, Stability has huge international support because I’ve positioned it just right at the right time. And my Asperger’s and ADHD typically balance each other out, I like to say. So you’ve got to focus on what you’re good at, and that’s what I’m good at. That’s my job here: to absorb the hate and to also do the big things while letting the real heroes who are the developers, and the community, get on with things. Also, it allows me to have a different perspective in that most companies would try to control this. But, really, we are just trying to capitalize it and get it out there because, I think, again — from a mechanism design perspective and morally — that is the right thing to do. Lukas: Interesting. Well, we always end with two questions, and I want to make sure I get them in. The second to last is pretty open-ended, but usually we ask what’s a topic in machine learning that you think is underrated. You’ve mentioned a whole bunch, but is there anything else that you think is deserving of more study than it gets right now? Emad: Machine learning. I think it’s really data, to be honest. It’s like, you can say classical AI was largely data science, but the role in data in these models is vastly… just not looked at at all. I think that you can do 10 or 100 times less data for better outcomes on these models once we really look at it and how the data impacts the latency and some of this other stuff. So, like I said, we’re building a team for that. And other people have been doing data cleaning, but I don’t think that’s enough. I think we’ll see some remarkable things advance in that aspect. Lukas: It’s so funny because my last company that I ran for 10 years was data collection, and we always found, actually, data cleaning was the most important thing that anyone could do to make their models better, but we could never convince people to do as much data cleaning as we thought they should. Emad: Everyone’s like, it’s cooler to stack more layers, right? Lukas: Yeah. Emad: It’s data cleaning, data structure. There’s a whole bunch in there. Lukas: The last question that we always ask is what’s a hard part about taking a model and actually turning it into a product? You’ve obviously just recently created some products built on top of these big models. I’m curious, outside of the training of the model, what’s been maybe some unexpected challenges in making the whole product work cohesively? Emad: We have DreamStudio Lite and DreamStudio Pro coming up very soon. I think, probably, the key challenge is just getting it responsive enough to have, really, that user experience that is seamless. We’ve gone to sub one second now on inference, but that was very difficult to do. We had to do a lot of optimization there because, again, these are — even if this is relatively small, it’s still a large model, right? The second part, I think, is around some of the fine tuning and creating custom models. That’s a pretty different take on things. I think there’s a lot of work that’s been going on into where do we actually store the models and keep them, and the user data aspects of them becomes a very curious thing. I think the most important thing is just having the snappy consumer feedback loops for these large models that will maintain, especially because we’re doing animation, which people don’t want to wait around for. They either wait a long time, or they don’t want to wait at all. Like, “Why isn’t it real time?” Because normally this would take like three weeks, you know? Lukas: That does sound challenging! Well, thank you so much. I really appreciate it. That was a fun interview. Emad: No problem, Lukas. Cheers, buddy.",13638 +Jehan Wickramasuriya — AI in High-Stress Scenarios,https://www.youtube.com/watch?v=unPEuc-HV4s,3602,2022-10-06,"Jehan: This is what I mean when I say “the complexity for a machine learning team is actually exponentially increasing.” You have to look at these other machine-driven ways to increase the quality of your data and augment your data sets. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Jehan is VP of AI, Platform and Data Services at Motorola Solutions, where he's responsible for a vast number of machine learning models in lots of different applications running live in production. This is a very practical, useful conversation, and I hope you enjoy it. Lukas: I was thinking that probably the place to start here is actually what Motorola does. I feel like Motorola has this brand for people my age making phones. Jehan: I was just telling Kelly that I—actually even stranger—when I finished my PhD, I started actually working at Motorola before, right around the time the iPhone came out, and actually worked on mobile devices. Then after a stint at other companies came back. I think like telling people what… definitely that brand is stuck in people’s heads. When they think of Motorola, they think of those things, which is not what this company does at all! Lukas: Right, right. So why don’t we start there? What are, like, the key things that your company does? Jehan: Yeah. So Motorola Solutions essentially is completely focused on the safety and security of the community and enterprises. So essentially there are a couple of different segments of the business. One focuses on enterprise security and video security, physical access control. And then the other focuses on public safety for first responders. Essentially everything from when a 911 call comes in to dispatch units, and then resolution of the incident and case closure, that’s Motorola Solutions’ focus. I think in the public safety space, we’re most well known for our mission-critical communications infrastructure, which has been something that first responders have relied on for decades now, which is when times get rough, when you see firefighters charging into burning buildings, the radio is what they really focus on, especially back when communication coverage through broadband was much more sparse than it is now. And it’s still a huge challenge in many parts of the world, not just the United States, but even overseas in the UK. So in general, that is Motorola Solutions’ mission: essentially to provide safety and security for those two segments. Essentially it’s the same audience. It’s making the community safer, but in terms of how the product portfolio is situated, it’s basically those two segments of the business. Lukas: Interesting. So how does AI fit into that? I can think of lots of different ways, but like practically what goes on? It seems like a really high-stakes place to introduce artificial intelligence. Jehan: It definitely is. And if I think about my journey coming to the company, I’ve really worked in consumer most of my career. So machine learning, we kind of took it for granted that it’s just a tool that you use you know… the applications and services you’re building, you use it to essentially accelerate, automate and help decision making. Here you do the same things except like you said, the opportunity cost for some decisions that may be incorrect, and also bridging that human understanding, is quite high. So I mean, I would say the mission is still the same for all of us who work in machine learning, where we want to kind of maximize human potential and use it as an assistive tool. I think the reason that this is so important here is that many of our users are in very high-stress situations. So when your cognitive bandwidth is limited, your ability to make decisions as a human is definitely hampered. Now one thing that flows complementary to that is that the amount of data is exploding. The amount of data that these users have to consider day in day out is exploding, whether it’s a 911 call taker or a security guard: more video, more audio, more unstructured text, more structured data, more communication. So then the question becomes, “How can I use AI to be able to simplify that?” And I think it’s not just an AI problem, it’s also a usability problem. And actually, it’s funny. This weekend, I was reading a book by Katie Swindler, which is “Life and Death Design.” And increasingly there’s a lot of these kind of usability considerations for designing for people in high stress situations. And I think once you get past the frozen response where your prefrontal cortex kicks in and then you’re like, “Okay, now what do I need to do?”— I think one of the things that really stuck out for me was designing for experts. Because in public safety, and even in video security, you have a lot of expert users—whether they’re someone who’s been watching video for years and years, they know exactly what’s happening on every single camera, they know the playbook that when something goes awry what to do—expert users typically, when you speed things up for them, they tend to do better because they automate a lot of the standard stuff. The stuff that kind of has to happen ahead of actually using their brain to actually figure out a problem, they automate a lot of it. Whereas if you take a novice user—and I’ll get to why this is important in a second—for novice users, they do want to think it through before they get to anything. I think for expert users, that’s becoming a luxury that many of these roles don’t have anymore. Staffing is challenging. I don’t know how much the audience knows about, I would say, public safety. But a lot of those roles… like when you call 911, your life is essentially in the hands of someone who’s taking that call, figuring out what’s going on and bridging that help to you when you need it. Those expert users are now churning much, much more, in which case training becomes a huge problem. Expert users tend to do better because they’ve kind of simplified the workflow. I think this is where AI can really help in that process. I think for novice users, AI can bridge some of that gap. They don’t have years of expertise to fall back on, where AI can help bridge some of that so that they can actually focus their attention more effectively. Lukas: Can we get a little more specific about a single use case and what your software is doing in that? Jehan: Yeah, let’s take video security, for example. So, traditionally, when you think about video security, you think of someone who’s watching video. As a company, one of the North Star goals that we have is that no one should watch video in the limit because it’s actually impossible. It’s so ridiculous when you actually visit one of these, whether it’s the security operations center or real-time crime center, you’ll see how ridiculous it is to have someone watching all of that video. The second is managing disparate systems. Whether it’s enterprise or public safety, you have a lot of different vendors in the space. The space is extremely fragmented. I think about it a little bit like healthcare sometimes, where you have information— it’s just present in different systems. So the question really becomes, how can you centrally manage that? And we’ll talk about cloud and AI a little bit, maybe, as we go on. But it’s really about, “How do you optimize that response?” So we use analytics, we use AI to be able to help the operator not only focus on what matters within an individual video stream, but also across those different video streams and different systems, be able to surface relevant information. And relevance is really, I think, the key part of what we’re focusing on now. Lukas: And can we get even a little more specific? Where are we? What’s a customer? What are they trying to do? I mean, I know nothing about video security, so I think you’re really going to need to walk me through it. Jehan: Okay, so let’s set the stage. So for enterprise, some of our biggest customers are, for example, schools. School is a very unique operating environment, I would say. Especially in the United States, with a lot of the issues that have happened here and continue. So typically you have two classes of users in video security. You have those who monitor video, so a SOC where they essentially pay people to watch video and essentially deal with alerts from the system. At that point, you have systems that may be not AI-enhanced at all—no analytics at all—where you’re just watching video and you have to watch the video and essentially deal with it as a human. Increasingly, many of those video security systems have AI. So you’re actually watching events and you’re viewing video. The second class of user is not watching video at all. In fact, it is very rare to have someone spend the whole day at their desk in many of these cases, especially at school. You might have a single roaming security guard who is essentially going about their job checking on different things in the school, dealing with student related issues, tending to the staff. The only thing they have is a mobile phone in their pocket where you may be getting alerts from your underlying video security system. So essentially you have to figure out how to deal with those alerts, including the accuracy of those alerts and triaging that to the right response. But that’s basically the two customer bases that we have. So the problem we want to be able to solve is, “How do we get the most relevant alerts to those customers and build a user experience where they can effectively deal with the situation when they’re under stress.” Lukas: And so what does an alert look like? Jehan: So an alert may be just an event that comes from a video security system. For example, many of the cameras that we build today have AI embedded in them. That AI essentially allows you to set up different rules. The customer sets up different rules. For example, they may set up a line crossing rule that says, “Okay, when someone crosses this line, send me an event.” That event will basically have, “This rule was triggered. Here’s a snapshot of what happened,” so a person crossing the line, “and some other metadata depending on how the rule is configured.” So essentially, an alert will be… it’s very similar to an Event-Condition-Action-type workflow, where the action is performed by the human, but the event is usually taken care of by AI, typically, whether it’s using some combination of object detection and tracking classification. And then the condition is usually set up by humans—and we should talk about that because as a company, and from an AI perspective, we don’t believe that rules are the right way to go, even though much of what we know as AI came from rule-based systems. Configuring a system using rules makes it very difficult for humans to be able to take what they have in their head visually and then map that to something that they need to look out for in the future. Because proactively, you set up all of this configuration in the rule—which depends on analytics metadata and AI metadata—but most of us typically don’t know what’s going to happen in the future. This has happened at setup and might never be changed for a long time because it may be complex to go in and change it. But me, as a human, when I see something, I know that it’s not right. That’s one thing as humans we do a very, very good job of is that when we visually see something, we can reason about the fact that there’s something gone awry there. There’s something that we want to know about in the future. The way we’re building systems today is to be able to get closer to how humans think and allow humans to essentially visually specify the things that they care about so that we can essentially push this workflow from being largely a very reactive workflow—and I say that across public safety and enterprise—to a more proactive workflow. And this is where AI can really help. Lukas: And so how do you frame the problem as a vision problem? Like, are you trying to track all the people and objects and then set up a rule that’s like, “If a person goes in this area where there’s not supposed to be a person, we fire an event?” Or is it more unstructured, like if we have training data that’s like, “People in the area there’s not supposed to be people,” and then we’re just sort of looking for a custom model to flag something? Jehan: That’s a really good question. So I would say the majority of use cases and most vendors in this space… and many vendors have transitioned into deep learning-based models which really opened up from a vision standpoint what we can do, obviously. But typically you have object detection as kind of the core to everything. Which is, people and vehicles are the biggest thing that you care about in most of these. Doesn’t matter, vertical-specific or otherwise, you’ve got to have that. If you’re running analytics on a camera, you’re also doing tracking, obviously. Because tracking can provide you with a lot of other pieces of metadata, not only to make your object detection more efficient, but also you can create different rules around speed, direction, things like that. So how it’s done today is you have a set of analytics that run, whether it’s at the edge, on a server or in the cloud… and we should talk about distributed computation because I think this is a key part of where we’re going from an AI perspective that I think is a little bit different from where we are today. You typically have those analytics generating primitives, metadata around, “I detected a person, okay, I subclassified this person based on attributes I understand,” same thing for vehicles. Now I take that metadata and I set up different rules as you said. So if I want to know, for example… if a blue car is what I care about, I can use that metadata to my advantage to set up a rule. That rule fires, you get an alert and then that alert goes back to what we talked about at the start where a human can take some action on it. Lukas: And does it get as detailed as, “These specific people can go here and if they’re not sort of on this list of people, don’t let them go in this area?” How advanced does this get? Jehan: That’s—again—I think a very important point. Detection and even matching happens at a level on a computer vision domain where we don’t need identity, for example. What you’re talking about is now connecting that metadata with identity. So, “I do not want a particular person to enter because this person is on a watch list. It might be someone who’s dangerous,” and so on and so forth. So that’s when you get into things like OCR and facial recognition, for example, where now you’re connecting identity with those descriptive AI analytics, where I don’t know who it is, but I know how to find the person in the visual domain, for example. That is something customers can do on their sites, and that information is managed completely by them. But in terms of getting the analytics down from an object to an actual individual, you need that second piece of information to be able to connect identity. Lukas: Interesting. Do your models typically run on the edge or the cloud or is there some kind of hybrid situation? How do you handle that? Jehan: That’s a great question. For our analytics, we use all three, and our vision is that AI really needs to be democratized for users regardless of the equipment that they’re using. Some people may invest a lot in edge hardware, where it’s typically quite expensive, but you can run a lot of the AI computation efficiently at the edge. We use a variety of AI SOCs, depending on the platform, but we also leverage distributed computation. Because one of the usability factors that we think is important is the ability to centrally manage information that comes from your AI models. For users, that’s a game changer. And so we distribute computation. It should be transparent to the user. You might have a cheap IP camera, for example, but you still want to get the benefits of AI. So at that point, you may be doing the bulk of your computation in the cloud or on a server on premises. How you make that cost effective, there are some interesting things that we do to do that. I think the biggest benefit—how we think about the edge when it comes to AI in vision—is the camera really tells you where to look. Once you can focus attention, then you can actually be much more opinionated and sure about how much computation you spend to analyze that attention. But in the limit, we typically can deal with very simple cameras that essentially only have motion-based alerts, which can be very noisy because they can be triggered constantly. And then our cloud AI essentially is able to analyze that and figure out if that’s actually a true event or not. Lukas: Interesting. So the primary reason for going to edge is just that it’s faster and uses less bandwidth. Is that right? I sort of thought there might be data privacy issues that would cause a lot of customers to go local. Jehan: Absolutely. I think customers have lots of different reasons. So outside of the technical challenges around bandwidth and compute, absolutely. Some customers prefer to manage their data entirely on-premises, and they have that option. That’s essentially the way we build our system. They have the option of doing that. I think increasingly customers are seeing the benefits of centrally managing, even if their data is on-premises. Which is where federated systems become very, very important, right? So, “How do I bring the benefits of centrally managed AI while still operating on AI metadata that is generated on-premises?” We do have solutions that also do that. For example, I may be able to conduct a natural language search from the cloud, but that cloud search gets executed on-premises. So if I’m doing a similarity search, for example, where I’m essentially searching in the embedding space for an answer, I may not be storing any of that data in the cloud. It may be on-premises. Flexibility is key, I think, both in terms of privacy and in terms of managing compute and bandwidth. Lukas: How do you think about the evaluation of your models? Let’s just take the object recognition to be really concrete. I would think that every customer would have different levels of quality in their object recognition depending on what cameras they’re using, what the background looks like even. And then I would imagine that both kinds of errors are bad for you, right? You don’t want false positives, obviously. And then you also don’t want false negatives because it’s sort of operator fatigue. But then also, I’d imagine that you might be violating contracts if you sort of miss a real event that you want to trigger. So how do you think about that? Jehan: That’s an excellent question. And I think it also comes back to one of the reasons why we work with Weights & Biases is essentially exactly that. The evaluation of these things have gotten a lot more complicated and nuanced, I would say, even over the last few years where, initially, a lot of vendors—including us—trained one model, essentially, that is deployed on the camera, and you can essentially buy a camera and use it anywhere. What we’ve learned over the last few years is that these different form factors, different fields of view, different environmental conditions mean that it really doesn’t matter how much data you have. There’s always going to be an element to that that needs to be essentially adapted to the customer. So, in terms of evaluation, how we look at it is we kind of break the machine learning components down to its primitive pieces. For an object detector, we may not change that as often because that is something that uses the most diverse data that we have, is as generalizable as possible. The one thing you might do in that case is you might train different variants, different resolutions, you might have different models that deal with thermal, for example. We have thermal cameras that are trained on specific data that is specific to that. So, you have models that don’t change that often but essentially provide you that core signal of where to look. And then as you go downstream, you might start getting much more specialized where a user might have a very specific notion of what attributes they want to subclassify, and it’s not the same across different users. This is where the cloud helps and also where we can deploy additional types of models. In terms of evaluation, I would say that those types of models are evaluated on a much more fine-grained case basis. So customer by customer, region by region—we kind of cluster different types of customer types to be able to understand how the models are doing. And I think part of it is also our customers have gotten a lot more used to video analytics-driven systems, where they will come to us and say, “Hey, this isn’t performing this well in this situation.” At that point, we’ll go and take a look with the customer. We would collect data to try to understand what’s going on. And then our machine learning teams would dig in to try to make adjustments essentially based on that customer. So being able to actually do that and scale a machine learning team has been one of the big challenges I would say in the last couple of years, where we’ve really focused on the best tools. It started with data and data operations. We worked with Figure Eight when they transitioned up—and I know that’s kind of like your background—and when I came in and started running the team over here, that was my number one concern: data, data annotation and data efficiency. I think now we’ve got a good handle on working with different annotation vendors and we realize that you kind of need a multiplicity of annotators and different coverage to be able to do these different tasks. Now what’s happened is evaluation has become a bigger problem, which is, “How do I connect my inference infrastructure with my machine learning evaluation tools?” How do I visualize that information so the different stakeholders, whether it’s a firmware engineer or a machine learning engineer, can see, “How did my new model do compared to my previous model.” And, “This specific customer problem that we fixed: how did it affect all the other customers?” And so managing my annotated image data sets, custom visualizations. Adding loggers to our visualization system so it can actually deal with our machine learning training repos and our model zoos. This has become probably the biggest area of concentration, I would say, in the last 12 to 18 months. Which is why we use tools like Weights & Biases across our team, because it makes it impossible to… We can do the evaluation. It’s bookkeeping for those, measuring across those different data sets and actually increasing our speed to be able to do that is probably the biggest… I would say that the barrier to entry right now is probably that in terms of getting new models out. Lukas: So does every customer kind of get their own evaluation before it goes live? Jehan: Typically no, because we have so many customers. We have thousands and thousands of customers, so that becomes very difficult. I would say the thing that we’ve learned to do is… so a lot of customers, if they have an issue with a particular model or the analytics will come through support, it’ll get triaged into our machine learning team, for example. What we try to do is figure out if what these customers are seeing is common. What other customers may be having this issue? How do we cluster and segment that so that we can go after the problem? Because, as you know, machine learning time is precious and so we try not to solve problems that end up truly being a one-off. There may be other ways that a customer may be able to deal with that problem. Also, you have to separate model performance from installation and other things like that, where a customer… like, their camera might have moved. They might have not positioned it in the right way for that task. And actually, this is some of the stuff that we’re doing now in the cloud to be able to do things like camera help: use machine learning to determine if the camera has moved, if there’s a spider web on it. I mean, things as simple as that because initially that used to just hit the machine learning team. And they were like, “Why are you seeing bad performance?” Okay, after a bunch of back and forth with a customer, you find out this is the problem. Okay, we want to automate that. We want to use machine learning to help us find issues like that. And so that’s not as exciting as maybe developing the next object detector, focusing on a new backbone that gives us much better performance. It’s things like this that really help us save time in the machine learning team so that we can do more interesting things. Lukas: This is probably not a well-formulated question, but I’m just curious, how many models are you working on at any given time? How many models do you have live in customer sites? Jehan: So, I run the AI team at Motorola. It’s not just computer vision, it’s speech and audio, language and NLP. So across all of those, that’s probably a very large number of models. If we focus just on the video space, I would say we’re still looking at under 100 models. So like tens of models. And very, very specifically, we’ve tried to keep a handle on it, because I think there is definitely a need for more custom solutions. But managing those solutions—as you pointed out before—we want to make sure that we have the ability to monitor those effectively and evaluate those effectively. It’s still a relatively small number of models, but they touch many, many, many, many customers. And so monitoring and evaluation becomes a huge problem, as well as going back to annotation. I mean, we’re looking at other things like weak labeling approaches, confident learning. Alex Ratner over at Stanford. Snorkel AI, the company that they spun out. We used Snorkel actually a few years ago when it was an open source project. And it was difficult mostly because of all the engineering and plumbing needed to actually make that happen. And now that’s what I think Alex’s startup is doing now with Snorkel Flow. And, you know, I talked to him recently. I think it’s solutions like that that we really need to get into the edge cases for AI. I think you don’t have data from all these customers. And a lot of customers don’t feel comfortable sharing data, which is completely, I think, fine. We have to find other ways to solve the problem. Another example is a company called Cleanlab, which is, “How do you learn with noisy data?” At that point, you’ve accumulated a massive amount of data from different places. Label quality may be highly questionable. So then the question becomes, “How do I actually reason across that in a systematic way?” I know you’re smiling because these are the exact things that I think Weights & Biases helps a lot of its customers deal with. But this is what I mean when I say “the complexity for a machine learning team is actually exponentially increasing.” You have to look at these other machine-driven ways to increase the quality of your data and augment your data sets. So now you have to evaluate those. You have to evaluate models that actually help your data, which also needs to be evaluated. And so I think just getting a handle on that is one of the challenges that we have. Lukas: So this is a really practical question, but I feel like a lot of people ask me what best practice is here. When a customer complains, someone is like, “Hey, you know, this thing did the wrong thing in this situation.” How do you actually triage that? And what are the likely things that you end up doing to fix it? Jehan: Yeah, really good question. I think especially when it’s distributed computation, where they’ve got a camera that’s running an AI model or a bunch of AI models, they’ve got a server on premises that is also running AI and cloud. So the customer doesn’t care and they shouldn’t care where any of that is running. They’ll just say, “A particular event is false positive… the false positive rate has increased dramatically. I’m seeing this problem, go solve it.” And so typically that hits our support team. And I think we are continually trying to make sure that we’re giving our support teams the better tools to be able to triage it. And I know this is a problem that you guys are very familiar with as well, where otherwise it’s a sieve. It passes straight through down to the AI team and you have now you have machine learning engineers and data scientists getting involved way too early. Part of the problem… Lukas: And before you go on, I’m just curious, are there any tools that you’ve used that you’d recommend? Like, are you using any kind of ML explainability tools or is it kind of home built? And if it’s home built, what kinds of things are you showing to the support team? Jehan: It’s a mix. The first thing that we built a lot of in house is a lot of visual tools. So, you know the system. If you can get video from the system—you have video clips or images—how can I feed that in and dump diagnostic data immediately, and then distill that diagnostic data so that the support team can at least try to figure out where the problem is? Is it happening in object detection? Is it happening because of a classification problem? Is it happening because of the environment, the environment has changed where your performance dropped off a cliff? So we built some homegrown tools specifically for cameras because the camera is probably one of the most difficult things to debug, because you have essentially AI running on a firmware build. We do have to do manual field testing of those cameras as well. You can’t just test it upstream when we generate a model. So a lot of the homegrown tools are particularly to deal with our cameras where we can dump the data and understand it. Explainability is a very interesting point. We’re trying to do more of that, where we’re trying to work with a few more tools that exist out there where not only can we get some of that meaningful information out, we can map it to what they understand. Because as you go up the stack—different levels of sophistication in terms of what we run—I think the really important part is feeding that information back into our evaluation, where you started. If we have a problem with a customer and the support team is able to identify it, maybe they pass it to our QA team. So now the QA team has more sophisticated tools. They maybe… actually, they do use Weights & Biases today. So for example, they can go and check the machine learning team’s last, whatever, X releases and all of the results are there. They can go and run an evaluation by themselves. We made it as turnkey as possible. So the level at which the AI team operates is different to where the QA team operates, where it’s dead simple. We put some kind of abstracted UI on top of it where they can essentially run the same type of evaluation over the new data that has the problem, be able to understand where the problem is happening and then involve the AI team, where they can jump in and actually do this. I can’t underplay how big a difference this has made because initially all of those requests were coming straight into the AI team where we’re getting overwhelmed with requests. And a lot of it is triage. I would say 70% of the time on average is triaging and identifying the problem. Fixing the problem is typically not too bad with the exception of problems where you have gaps in your data or something more fundamental that you need to fix. Lukas: And what are the fundamental problems that you might have? The ones that are really tough to fix. Jehan: I think typically that happens when our data sets essentially don’t have coverage, where you essentially hit a particular environment or a field of view where you just don’t have the training data in the model to be able to actually adequately deal with it. Or actually, you might have a new model. For example, some of the new models we’re working on specifically focus on identifying very small objects at distance. That is a very difficult problem because it’s difficult for a human and it’s difficult for a CNN. When you try to disambiguate something at 300 meters, it’s basically a patch. I mean, at that point, you’re just doing motion detection. So you have to think outside the box a little bit in terms of figuring out what that is. But typically… that’s one example, where many of our customers still use AI for perimeter protection. So object detection at range is something that is a constant query, I would say, especially after we moved to deep learning-based analytics. In some cases, customers think that the previous generation of cascade-based models worked better because they don’t actually have to do detection. It’s essentially blob identification and motion detection. So when they lost some of that capability, they’re like, “Well, why isn’t the CNN, why isn’t the object detector actually picking this up?” And we kind of have to explain it to them. One of the things that we’re very proud of today is we’ve been able to combine some of those techniques together where typically you’ll get a detection that ends up being very low confidence where you would typically wouldn’t pass the threshold for an alarm. Whereas now, for those low confidence detections, we can—under certain circumstances combining different types of metadata—we can say, “Let’s take a second look using a different technique” to be able to say, “Is this actually an alarm that a customer might care about?” To be able to combine those things together. And I think that’s just a large narrative around multimodal analytics. I think, for the most part, object detection is largely commoditized. If you look at what startups need to do to get a viable object detector today, whether it’s using the latest YOLO variant or whatever, most people can get going pretty quickly. I think where you end up having issues is exactly the areas that you’ve been asking me questions on, which is the edge cases: whether it’s extreme range, certain types of conditions where you might not have the training data. I think this is where customers end up having problems. So to go beyond that, I think… this is almost getting to a part, I won’t say exactly, where speech recognition got to. It got to “good enough” very, very quickly, where essentially gains in training ASR models typically wasn’t worth the kind of exponential effort. So then everything shifted to natural language. It’s like, okay, “Well, the transcripts that I’m generating are pretty good. Now how can I do language-based tasks more effectively?” And there’s a bunch of NLP work that we’re doing in that area. And I think NLP has become a huge influence for us in vision, as well. I mean, this past CVPR, for example, everything was language plus vision, whether they’re jointly trained models or separately reasoning, using language to reason across vision-based models. This is something that we’ve been looking at for a while. So I would say two big trends in the computer vision space: one was unsupervised, semi-supervised learning. You’ve seen Meta, Google and other companies like that really show what’s possible at extreme scale. And then secondly is effectively using language not only to understand human intent, but also to interpret what the user is seeing. And like this is exactly the question you asked me before, where when you get an alert today, that event image pair is not terribly explainable, right? If you have a lot of training, you can look at that event and that image and say, “Okay, I kind of know what’s going on.” But being able to take that result and, in just plain language, explain what’s happening, not only helps us digest it better from a cognitive bandwidth standpoint, but it’s just way, way better to go, “Yes, I want to capture that. And I want that alert to happen again.” And I think this is where we’re really, really hyper-focused on using language as the glue to be able to essentially move away from logic-based rules and use the way we naturally think about problems to be able to capture future alerts. Which is also why, I mean, two sides of our business… you asked about alerting. The other side is forensic and search. We truly believe that everything we’re doing in search, which is heavily NLP-based and NLP plus vision-based, can help us bridge the gap to help users actually create new alerts that they can look for proactively. Lukas: Sorry, I think you need to give me another real world example of what this forensic search looks like. Why am I doing this and how does it work? Jehan: Okay. So, today forensic capabilities in a video management system—leaving aside alerts—I know something happened. Now I’ve got to figure out why it happened or where a particular person is. Now I fall back to using my search engine, essentially, in a video system. Lukas: Sorry, I think you need to make this even simpler. Like, why am I doing this? Someone broke into my school and I wanna… why is this hard? I would think you’d just sort of look at the video feed and see what’s going on. I’m sure that’s a stupid, naive interpretation. Jehan: So a couple of different reasons. A very, very simple retail use case: loss prevention. Something has gone missing off the shelf, for example, or someone stole something. I know that that happened. How do I trace it back to figure out who it was? When did it happen? For a school, for example, you know something terrible may be happening, where you’re reacting to what happened. The question is, “How do I know where that originated?” “How do I make it safer next time so that it doesn’t happen again?” And, “How do I gather information beyond a single camera?” I think this is the crux of the use case, actually. Many sites have multiple cameras. A lot of analytics today focus on single-camera events. So a single camera is going to generate an event for you. Now the person has moved on, they’re in a different camera. They’re in a different part of the site. This is where search really helps, particularly things like similarity-based search because now I can use that visual cue of who it was and search across all my cameras. This is really where they dip into the investigative space, where they saw something happening on a single camera, they take what they saw—whether it’s a person or vehicle—enter it into the system, and now the system will show you occurrences of that person or object across many, many different cameras. Now I can go deeper and understand, “Where is that person now, where was that person, and where is that person potentially going, so I can get ahead of the situation?” Lukas: And am I asking these questions in natural language then? Is that what the interface looks like? Jehan: That’s where we’re focusing a lot of our R&D effort today. Today, if I had to say, there’s two forms you can interrogate a search system. Visually—so essentially you can give it an image or an image crop of something, an object of interest, and systems respond to that. We can search the embedding space to be able to figure out if it’s a vehicle or a person or whatever else. There is structured search, where you’re looking for a particular attribute. I’m forming my query in the form of, like, “a man with a green shirt,” for example. What we’re doing right now—and we have been working on for a while and you’ll start seeing soon—is we want to make that as easy as searching for things on the internet, where you can essentially phrase that in natural language. We can use that natural language representation, then, to do more interesting things in terms of being able to bridge what’s in the vision domain with the language domain. Lukas: Wow, that’s really cool. It sounds almost like Star Trek or something. Jehan: But, I think on the consumer side it’s natural for us, right? It’s funny, like a lot of these verticals… Actually, I got similar comments where they’re like, “That seems like science fiction,” but if you think about consumer applications we are very used to doing that today as humans. But in a lot of these verticals—whether it’s healthcare or public safety or enterprise security—that’s just not how they do things, because the systems are just simply not sophisticated enough to be able to understand human intent and map human intent to structured data. One of the big problems that we worked on initially was… a lot of our knowledge base lies in relational databases. So then the question becomes, “How do I bridge what I’m seeing visually, or what I’m expressing in natural language, to structured data?” I mean, there’s a ton of very interesting work now using transformer-based models to be able to actually figure out, from an indexing standpoint, how do I actually query those structured data systems based on naturally what humans are saying. And we think that’s the future. I think making it easier for users to get information out of systems is really the bottleneck today. And many of the systems are too complex for users to actually figure out how… if I have to think which search to use, I’ve already lost valuable time. And in our business losing valuable time means, as you said at the start of the conversation, is a huge problem. Lukas: Well, it’s funny. I feel like I, obviously, when I’m talking to a friend I like using natural language. But when I’m engaging with the computer I feel like these natural language interfaces have gotten a bad reputation over the years for over-promising, and then just being frustrating when it’s not doing the thing you want. You don’t know what’s the next thing you should do. I guess, do you feel like the natural language understanding technology has gotten to the point where this is really feasible? I feel like I don’t actually engage, maybe ever, with an automated question and answering system that seems to work really well. Jehan: I think that’s actually a really, really good observation, and I would say I agree with you. I mean first impressions matter, right? If you use one of the voice assistants and it doesn’t work for you a couple of times most people will abandon it because they just assume that the coverage isn’t there. I still think it’s a huge challenge in general because the language space is so vast and users can interpret their intents in so many different ways. One thing we have to our advantage in what Motorola does is… our vocabulary is actually fairly narrow. If you think about safety and security, whether it’s public safety or enterprise security, you generally want to ask the same sort of things. The five W’s, for example. Like, you’re looking for a person or a vehicle and you’re describing the attributes. So I would say that the domain space of intent is narrower but it’s much deeper, so you need to perform really, really well on those very fine grain parts of the intent. So, for us natural language actually… the last couple of years of work that we’ve been doing has been very promising because not only can we constrain our models, if you look at a task like captioning for example. Captioning is a very difficult task to get right. You need a lot of data to be able to perform really, really well. If I think about something like captioning for us, we can really constrain the space that we’re looking for because we’re looking for those same things, and so we can really double down on what data sets we’re using and how we train those models where they can perform really well. That’s where I think, for us, language is very promising because of the type of problem space that we’re in. Lukas: That makes sense. A practical question I have given that you’re running all these models on live feeds of information—like, you actually really are running at scale and probably need really high uptime. What does your production environment look like? Is this another thing where you’re using third party tools or you’ve built something yourself? Jehan: The DevOps situation gets quite complex, especially when you’re thinking about data that’s running on premises as well as in the cloud. I think a lot of the ways we’re bridging that is, essentially, like I said at the start, we’re using central management. A lot of our cloud software runs pretty much the same way as any other vendor runs at scale, and we have redundancy and failover support for that. At the edge, it’s really about monitoring. So it’s making sure that we have good information about what our cameras are doing, the health of those cameras, being able to get the right metadata to understand model performance so that we know when something’s going wrong. We use a couple of different tools today that we’ve built because we are dealing with formats from our own cameras and data that’s highly proprietary. But I think we’re always looking for other tools where we can essentially centralize a lot of that monitoring capability because it is very complex. You have multiple pieces of hardware and software running together. So, it’s not just, “My cloud service went down.” It’s, “Okay, my camera malfunctioned and now things aren’t working there, in which case, everything downstream is not going to work, either.” I would say it’s a work in progress, in terms of making sure that we have good coverage as our solutions become more distributed. Lukas: Are things like data drift real issues for you that you look to detect? Jehan: Absolutely. Especially, I think, when it’s the first model of its type or it’s a new capability that we release. We spend a lot of time in house being able to test a lot of that stuff across as big a diverse and comprehensive data set as possible. But when it’s out in the field, we start seeing things like data drift happening, where it goes back to the question you asked before. As we learn from customers… that’s one way we can alleviate that, which is a customer might have an issue. We might recognize that being a common issue where we can address some of those. But we’re also proactively looking at our models and seeing, “How can we combat things like data drift?” For example, things like synthetic data have become a huge tool for us in certain areas where we’re either unable to collect real data or there’s sensitivity around collecting that sort of data where we simply don’t do it. How do we augment our models with those gaps that we have? And we work with a number of companies on this synthetic data front and we’re doing a lot of that in-house as well, where we’re trying to fill some of those gaps to make our models as generalizable as possible. But as you know, it’s definitely a work in progress in terms of keeping a handle, especially as the number of models kind of explodes. Lukas: Wow. You know, you’re one of the first people that I’ve talked to—maybe the Waymo head of research was the other one —but most people, I feel like, think of synthetic data as more of a theoretical thing that they’re sort of working on using in the future. It’s interesting to talk to someone that’s actually using synthetic data today to improve the models. I’m curious: I mean, if you want to name any vendors that are working well for you or techniques that worked well, I’m sure that would be useful to the people listening. Jehan: I mean, I can mention… so, we worked with a company called AI.Reverie. It actually got acquired by Meta not too long ago. So that was a very public vendor. There are a couple of others that we’re talking to right now that I probably can’t share the names just yet. But I think one of the areas… you’re right. There’s a lot of, I would say, misinformation and misunderstanding about how synthetic data is useful. I think there is one camp that believes that you can use purely synthetic data to train certain types of models. And that may be true, especially certain classification tasks you’d benefit a lot from essentially just purely using synthetic data to cover the domain gaps that you might have. I think where it gets tricky is when you have a non-trivial amount of real data and you want to be able to augment that with synthetic data. At that point… it’s really funny because, initially, we started working with vendors as dataset providers. Essentially, you’d work with them, give requirements, and they would deliver a dataset to you, and you’d do all the training and experimentation in house. And then you realize very quickly that actually you need to do it end-to-end. And now you see a lot of companies actually doing that, where some of them are actually also selling tools for other companies that say, “Okay, you can generate your data. These are the knobs that we’re going to give you. And you can retrain your models and do that kind of in an iterative way.” And that’s really where we’ve landed today, where you can’t really think of synthetic data as something you get from a vendor. It really needs to be part of the machine learning development process. And for us actually, right now, where synthetic data is the most useful is testing and evaluation. Especially if you think about analytics that go beyond single object, and you’re thinking about groups of things. Whether they’re groups of cars or people, this is a very, very difficult thing to be able to collect data for. Even more, I won’t go into this now, but when you think about anomaly detection, especially of a high dimensional data, it becomes extremely difficult to test these things, because these events are so rare to begin with. Right? So you absolutely need to have synthetic data to be able to do that. And I think, for the most part, rather than training… though we’ve done some of that as well for certain types of use cases, particularly subclassifications, attribute classification for certain things. Because obviously you can basically have infinite ability to vary things like color and hue and texture and things like that. But testing is huge, especially for things like groups where you want certain patterns. You’re trying to mimic certain patterns. We went back to schools. When people are panicked, especially when you think about a building that has entrances and exits, there are very specific patterns of human motion that you’re not going to be able to collect. Hopefully you never will because those things hopefully don’t happen that often. And working with synthetic data and essentially incorporating it into our end-to-end pipeline is what we’re doing today so we can very quickly model out those scenarios. Lukas: Wow. I mean, it’s funny. I feel like synthetic data companies come to me for advice all the time. And I always feel like, you know, it’ll be very clear if your synthetic data is working to help a customer and then you’ll have a great business. But that part seems really hard to do. I would imagine modeling people in a panic is probably an unusual use case, but like incredibly important, and you better get it really right if you’re going to try to- Jehan: I think it’s the same thing, actually. I get a lot of startups coming to me and saying, “Hey, we would like to offer this to you.” Especially data startups at this point and MLOps startups focus, honestly…. and you’ve seen that if you… again, going back to the latest CVPR, there was a huge push on synthetic data at CVPR, including the release and commitment to a new open data set, for example, for synthetic data. I think the community, especially the academic community, they just simply don’t know what these companies are doing and where they’re focusing in terms of what outcomes they’re looking to enable. And I would say that’s the same advice I give a lot of the synthetic data companies is… these are my problems! So, for example, “I want to be able to get a lot of data about human attributes where I don’t want to collect real data, can you build photo realistic data that is good enough for me to be able to train a model,” for example. Or focus on a specific vertical. Verticals where it’s difficult to be able to collect real data. And I think that’s what we’re starting to see. Like if I look at a few different startups now, they’re really trying to find their niche. The other part is tooling. I think this is one area where I pushed very hard initially when we were looking at, and they simply weren’t ready to share their tools because they were building it in-house to be able to generate data sets for other customers. And I think that is one thing where, if you have a machine learning team—like, you’re not outsourcing your machine learning development, you’re actually doing it in-house—those end-to-end tools that you can incorporate into your machine learning development lifecycle are super important. That is when I think a lot of companies will start to see the value of things like synthetic data: when they can actually develop, train, and test iteratively to be able to see how it’s helping them. Lukas: I’m curious, as a startup founder: AI.Reverie was a customer of ours too. We saw that they got bought by Meta, and congrats to them. But did that experience make you a little more nervous about working with startups? Jehan: That’s a really good question, actually. I think about this all the time, because a lot of the startups that you talk to, they may be here and then not here in a couple of months. And so tying yourself very deeply to one ends up being problematic. I would say, just in general—or at least our team and what the company does—we like companies that focus on platform and tools and build things in a very modular way. Because not only does it help us really understand what value there is in that… like, for example, data visualization. Huge problem. You don’t want… like, before we had data scientists building all different types of visualization. Hard to share, hard to have a library of those things to be able to replicate. Same thing around data. If we tied ourselves to a company that was just generating data for us and then they went away, and we have no idea actually how to generate that data ourselves, I think that becomes problematic. I think companies that focus on tools and platform, where we understand what makes them great because they’re focusing on a very specific problem... but we also have the intuition behind what problem they’re solving so that we can start to invest in it more in-house. So synthetic data is a great example. I don’t think it can be a completely outsourced thing. Companies are going to go after little slices of the problem. I think if you’re really going to be all in on it, you have to invest in tools and technology on your end as well. And so I think just as a general rule of thumb, that’s what we try to look at companies that are a little bit more open in terms of how they’re building systems and have a good diversity of customers. So that we’re not the only ones relying on this one capability, so that they will tune the solution because we’re their biggest customer, for example. That becomes problematic as you know. Lukas: Are there any other kinds of common mistakes that a guy like me makes pitching a guy like you? Like when startups come to you, do you have any advice for them to, I guess, be a good vendor to Motorola? JEHAN: I think, honestly—and it’s probably just a pet peeve of mine—but very few companies actually do any homework on what we do. So they’re pitching something which, if you just spent 30 seconds looking at what we’re doing, it probably didn’t make sense to pitch it. And I think the second part is: the volume of pitches is so high right now, especially in machine learning ops or computer vision, or whatever, NLP, that usually people like me who have to look at it… we have a very small amount of time to be able to actually make a decision. And I think when I look at it, when I try to make decisions, people is the number one thing. Like what’s the quality of people? I don’t care what problem you’re solving. Like, what’s the quality of the company? Where did they come from? What problems did they solve? That for me is number one. Number two is, “Did they take a little bit of time to pitch me on what they think is a good use of their technology for the problems that I’m solving?” I think those two things help me make decisions relatively quickly. And I think you can tell the founders or the companies who care when they maybe limit the number of people that they engage with. But when they do engage, they’ve done their homework and they kind of know that… they feel strongly that what they’re building could benefit the company. I mean, Alex is one example, Alex Ratner. I knew of Snorkel. We’d actually spent a bunch of development time using Snorkel. And I think that was a very easy relationship. I mean, and he himself reached out, which made it super easy because we were able to dive straight into “what problems are you solving with your company” and get engaged with them and say, “OK, now we know what path you’re on. We know that you’re someone we probably want to keep working with at some point.” And so that made it easy. Lukas: Awesome. Well, I’m sure that’s useful advice. Maybe we should end with our two questions that we always end with. The second to last one is: what’s a topic in machine learning that you think is underrated? Jehan: Oh, that’s a tough one because there are so many problems, I would say, out there. I still am a very, very strong believer in multitask learning and meta learning. I think you’ve seen the academic community go in that direction, but now we’re starting to see real results. I mean, I’ll point out one thing that—it’s not so recent but came out of Meta again—which was GrokNet, which is essentially using multitask learning to be able to basically do very accurate product recognition. We don’t do any of that. We’re not an e-commerce company at all. But one lesson there, at least that I learned, was being able to have a single model that does well across a variety of classification tasks and is trained and optimized jointly is something that is very important. And it used to always be that you’d choose a particular loss function that you’d care about. Now you use a multiplicity of different loss functions, some which were not even intended for that particular task. So for example, GrokNet uses ArcFace, which was developed for face recognition, but they’re not using it with anything to do with face recognition. They’re essentially using that to be able to find the cosine similarity between different embeddings in a very varied space. We do the same thing. We started out having n different models. We want to get it down to some n minus x amount again to the point of managing different models. So I would say multitask learning and meta learning, I think is still… people think it’s science fiction because, I think, academia-wise, a lot of people look at that and go, “I’ll come back to it in three or five years when it’s kind of ready to use.” But picking your spots, I think this is one area where I think it’s grossly underrated. The second I would say is user experience. I didn’t talk about it, but in addition to the artificial intelligence team, I lead the user experience research and design team here at Motorola. And I think those two things are critically essential to each other. It used to be that we would just develop algorithms in a vacuum, then go to the designers and go, “Hey, can you help me design some software around it?” And I think we don’t do that anymore. We start with a human problem, we try to design the experience, and then we try to figure out how the model can actually fit in that workflow. And I think any machine learning company should really, really consider that, especially when you go pitch your solutions to someone and you’re still trying to explain it to them after 30 minutes. I think then you probably need to tackle that. So I would say those are the two factors, for me at least. Lukas: Yeah, we hear that user experience thing over and over and over. It’s interesting how there’s a lot of movement back and forth between ML leaders and product leaders, I think, which is super cool. Jehan: Yeah. Lukas: I guess my final question is: when you look of speccing an ML application to deployed live in production, where do you see the biggest bottleneck is, or what’s the hardest part about getting a new model into production? Jehan: Yeah, so I’ll answer that specifically on the Motorola Solutions side, because that’s probably what your customers are interested in. So for us, especially if it involves edge hardware, the complexity is—as you know—synchronizing software release cycle with a hardware release cycle. That is difficult because you have deadlines, you have supply chain issues and things like that. So having to do that. The second part is… I would say the easy part is getting a viable model out of research, if you will. Out of R&D. We are very well equipped to do that. We have great tools. Increasingly, our training infrastructure is automated. We do training in the cloud. We have on-prem compute through our distributed training methodologies. The problem is once we have a viable model… and typically there was a framework issue before, whereas now I think we use interchange formats like ONNX, we can get a model out that is somewhat framework independent. Second is, “How do I optimize for the platform?” If it’s NVIDIA, I might have to use something like TensorRT. If I’m using an AI SOC, I need to be able to use that company’s tools to be able to not only optimize the model, do post-training quantization, which is not trivial. Now you’ve got to see, “Did I lose anything in terms of my accuracy?” So getting a model that looks good on as much data as we have, then optimizing it for a particular platform, that part is complex because we have to deal with a bunch of different platforms. Once we’ve got there, I think the question of “is it good enough”—this is something that machine learning teams struggle with a lot. And I think if you distribute that task between your QA team—or a test team, for example—and the AI team, there are very big differences in opinion on what might be good enough. You might go do field testing, and you might test two particular scenes, only two fields of view, and say, “The model is doing terribly here.” The machine learning team will come back and say, “Well, our data set is way, way bigger than that. We trained on like a million images across many different scenes, and we think in general it performs well.” How are you going to do that when it’s statistically insignificant on the manual testing site? So I would say optimization and testing, especially if you’re trying to get these things out across multiple platforms. Lukas: And I guess fixing the problems with the real world tests are hard also. Jehan: Indeed, it is, insofar as finding candidate sites, are you doing it the right way? And how do you scale that? Again, which is why we’re trying to use things like synthetic data a little bit more effectively. And one change we made was our AI data team originally only served our machine learning team. Now the AI data team also serves our platform team and our test team as well, which has started to bridge that gap a little bit in terms of test coverage. Lukas: Awesome. Well, thanks so much for your time. This was really fun. Jehan: Oh, thanks Lukas. I really appreciated it, nice conversation. Lukas: Yeah, thank you.",11307 +Will Falcon — Making Lightning the Apple of ML,https://www.youtube.com/watch?v=KDrSNUb9zEA,2721,2022-09-15,"Will: Users are always going to tell you incremental things. They're always going to tell you they want this better. They're never going to tell you they want the iPhone. They're always going to tell you, can you make my Blackberry keyboard slide out instead, or whatever. Those inputs are going to usually improve the product, but they're not going to help you create a leapfrog product, right? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. Lukas: William Falcon started his career training to be a Navy SEAL before becoming an iOS developer and eventually the CEO of Lightning.AI, which makes PyTorch Lightning, a very successful ML framework and Lightning AI, which is an awesome website that calls itself the OS for machine learning that we're going to talk a lot about today. This is a super fun conversation and I hope you enjoy it. Lukas: I thought it might be fun to start with your background. We don't have a lot of people that went through Navy SEAL training on this podcast. So, could you tell us a little bit of your story on how you came to found Lightning? Will: Yeah, sure. So, I'm originally from Venezuela. So I don't know if people know that. I'm actually born and raised there. So English is my second language, which is why you'll hear me slip up today in a few things. Code does not care what language you speak, which is great. So I moved here when I was in my teens and then eventually ended up going to the US military and I went through SEAL training BUD/S. So I was there for a few years. If anyone knows BUD/S, I was in classes 272 and 277, which is great. And I came out injured, actually, and so, I basically got stashed in one of the SEAL teams that does a lot of intelligence work. It's a very interesting team, so I also happen to speak Arabic from just fun, I guess. And so, there's a lot of cool stuff that we were doing there. Will: And when it was time for me to go back into training — this is when we pulled out of Iraq in 2012 or 2013 — the Navy gave me an option to leave or become a pilot or something, and I chose to leave. Maybe if I'd seen Top Gun, I would've stayed as a pilot potentially, but it was a great time. And we did a lot of good work there and very happy about the time. I think it really set me up for success for everything I did afterwards. I didn't care about school until I left the military, turns out. Lukas: And then how did you get into machine learning? Will: So, I was at Columbia doing my undergrad, so around 2013, I want to say. And basically people started telling me about this machine learning thing and I wasn't super into math or any of this stuff back then. I started my degree as computer science. And for some reason, the CS part was fun, but it wasn't the most interesting part. I really gravitated towards math at some point. And I think if you were doing anything with statistics or math in 2013 and you were touching code, it's like impossible not to run into SVMs and random forest, and all this stuff. I remember taking my first neural networks class and they were like, ""Yeah, you got this image."" And we've all seen this EMNIST thing that Yann [LeCun] put together back in the day with a carousel music. And I was like, I don't know why this is useful. I don't see the value of this. Will: And then many, many years later, I ended up working with Yann as one of my Ph.D. advisors. So at some point in my undergrad, I went into finance because it was interesting, I guess. And I went there to try to use deep learning in the trading floor. And finance today is probably, maybe, not so allergic to deep learning anymore, but back then it was, because of all the observability problems. So, I didn't love that, and so, I went back to school, I got into computational neuroscience, and that's really where I learned about deep learning and got really into machine learning. So, really the science is trying to decode neural activity and trying to understand how the brain works. So I still care a lot about that. And a lot of my drive is really the pursuit of science, but I find that a lot of the tools are really limiting to enable science to advance and do what it needs to do. Lukas: But then what were you seeing when you started Lightning? What was the problem you were setting out to solve in the very beginning of it? Will: When I started Lightning, I was still at undergrad, so this is around 2015. I was doing my research and I wasn't like building Lightning for Lightning or anything like that. It was just my research code that I had internally. And what I was trying to optimize for was how do I try ideas as quickly as possible without having to rewrite the code over and over again, but in a way that doesn't limit me. Because as a researcher, the worst thing that you can do is you can adopt something and you spend six months going through research, and then suddenly the last few months you're blocked and you're like, ""Oh my God, I have to rewrite everything,"" and then it discredits all your results. So flexibility was the number one thing that I cared about. So, that's a lot of what I was solving. Will: And over the years, really, I did open source until 2019, so it took about four or five years to get there. What I did during that time was just try so many different ideas. So my first research was, like I said, neuroscience, a lot of that was using GANs and VAEs. Then after that, I moved into NLP when I started my Ph.D. So [Kyunghyu] Cho is one of the main authors on the seq2seq and attention paper. So my first thing was to implement attention from scratch and a seq2seq network and all this stuff and learned, which is very rough if you guys have ever tried this, it's not trivial. I know Lukas has implemented this a bunch of times. Lukas: I've tried to do it once and I agree with you, it's non-trivial. Maybe it's not quite as hard, daunting as it seems at first. I don't know. I guess there's probably less resources when you did it. Will: Yeah. I mean, back then, you're writing everything yourself. Nowadays, there's attention heads and all this stuff you can plug in. But there, you're calculating your own stuff, and then PyTorch didn't support certain things, you're blocked and it was really confusing. So it was rough. And then we took that and then we started working on complex. So Cho also introduced GRU units. So we started working on complex GRUs, and the idea there was to help eliminate gradients from exploding or zeroing out. And so, complex numbers can help you do that, especially for audio, with some normalization techniques and all that. But complex numbers is not something that PyTorch supported, really until a year ago. So little old Ph.D. me, I'm sitting there and I'm like, ""Okay. I have to implement this whole complex number library,"" which I did and it's open source. Will: It's super slow, don't use that. Use the PyTorch one— it's better now. But it's willing to do what it takes, I guess, to get the thing done. But through all those learnings, eventually I ended up in computer vision and self-supervised research. I think if you work with Yann, there's no way you don't do self-supervised learning at some point. And so, I fell into it and this is 2019, I think, before it blew up. Well, before the world found out about it. People have been doing this for many years. Will: And so, all of that stress tested Lightning. And so, that was pretty flexible by the time that it got open source, I knew you could do a lot of this stuff. And then when I joined FAIR, it was a lot of like, ""Oh, can we use it for this or that?"" I'm like, ""Yes, of course, you can. Let me show you how."" And it just took forever to explain all the possible ways they could use it. And today I think it's obvious that it can work for pretty much anything, but it wasn't back then, and we still learn as we go sometimes and someone finds that it's not flexible for something and we fix it and we move on. But it's a long process. It's taken a lot of years to get here. Lukas: So when you go back to 2015, was PyTorch actually in use at that time? It was just Torch, right? I'm trying to remember what years these things came out. But certainly an unusual choice to build on top of PyTorch in 2015, if that's even possible. How did that happen? Will: Well, so my original version wasn't on top of PyTorch. So I had actually started with Theano. So basically what happened, I was using Theano and sklearn mostly. So I think I did what everyone does, where they take the model and they add the .fit to it. And then you start building off of that. And so, that was my original version and that was Theano. Have you worked on Theano? I don't know when you started, Lukas. Lukas: I think I might have touched Theano, but very little. I think I was using Keras on top of Theano, if that dates me. Will: Yeah. No, for sure. So I got really annoyed at it. I think it was great to show proof of concepts, for sure. So I started using Keras immediately and I think that helped me unblock a lot of stuff. But at some point, you end up running into limitations and I'm sure that's changed, but back then, that was true. And so, that happened and that's when I was like, ""Fine. I guess I have to go and get into TensorFlow."" I was trying to avoid it. And so, my first version actually was built on top of TensorFlow. But the second that PyTorch came out, which was a few years later, I rewrote it all in PyTorch, and mostly because it just felt more mathematical. I could see the math, it was easier. Whereas in TensorFlow, you had this duplicate layer where it was a meta language on top of the thing, which again, that's changed since then, but back then, that's kinda the world we lived in. So, it was very experimental. Torch back then was very hard to work with. Oh, sorry. It was easy, but installing things like that was really difficult. Lukas: That's really interesting. So were you at all inspired by the way Keras did things? Or do you feel like your Lightning was in contrast to parts of Keras? How did you think about that? Because I feel like Lightning plays a similar role to PyTorch as Keras plays to TensorFlow. Do you feel like that's too simple or wrong? Will: Yeah. I mean, I think when I first released Lightning and we put it on the Torch thing, I called it the Keras for PyTorch because at a high level it looked like it, but it really wasn't. So I may be the cause of this confusion, unfortunately. But like I just said, I used Theano, I used Keras, I used TensorFlow, I used sklearn. So a lot of my inspiration obviously comes from a lot of these things. Before I got into machine learning though, I was an iPhone developer. So I worked on iOS for a long time. And so, a lot of these ideas that people bring in as callbacks and all these things are actually ideas that have been introduced in objective since the 70s, 80s. Will: So, if you work on mobile, if you work on web, you've been exposed to these ideas. So I would say a lot of my inspiration really was I think the API simplicity like .fit kind of thing came from most likely a sklearn, I would say. And then I think that a lot of the callback and things like that... I was actually very opposed to callbacks. It turns out a lot of the hook names, and even if you see the way I've named things, a lot of them are inspired by Objective-C and these super long names. Actually, you told me you had started with Objective-C, so I'm sure you know what I'm talking about, but it's like super long syntax names. Lukas: I'm a little surprised you like Objective-C. I feel like most people, they hate it. And I think one of the reasons people tend to hate Objective-C is the verbosity, but it sounds like you see the sense in it. Will: Yeah. I mean, the verbosity makes it don't have to think about it. I hate when names are so short and you're like, ""What do you mean by this?"" Objective-C is like, ""You did load on this and that and that."" You're like, ""That makes sense. I read this whole thing."" Will: I think all of them did inspire me. And I would say, I think something I really liked about Keras was the feedback that you get. So, the summary tables and all of that, that's inspired by Keras as well. So I would say it's a combination of a lot of things, but I would say most of the things that I've really thought about really are driven in that fundamental like Objective-C worlds and that iOS world. Will: And in fact, if you look at Lightning apps, now the new abstractions that we put into Lightning, a lot of them are similar to that. So, they have a lot more elements of that. So, I think over the years, things have evolved. But no, I think Lightning's taken its own soul and its own thing, and its started to become kind of a tone paradigm that I hope that does become a standard in the industry, and I hope that it does inspire a lot of other people and especially in their APIs and how they write things, because I do think it works in scale. So I'm not offended if people grab the APIs and do something with them, because it means that at the very least we standardize ML, which is a win for everyone. Lukas: What's a part of the Lightning API that you feel super proud of that you feel like was different than what was around when you built it? Will: Yeah. So I would say the main two things in Lightning are the Lightning module on the trainer. And I think those are the two that everyone uses and those two together allow you to abstract most of it away. And so, I think that's really what I'm proud of. I think I'm proud of... the trainer really, I think has changed a lot and it's starting to become a standard across many other things outside of Lightning because it is a good API. And I think it's just the simplicity of it. The ability to see what's happening, change things and just see magic happen. Will: And I would say probably honestly the new stuff that we just released with the Lightning work, Lightning flow and Lightning app, it's taken us a few years to really think about this and figure out, how do we take those ideas from building models and how do you generalize that to building full end-to-end ML workflows, research workflows, production pipelines, all that stuff? And that's just not an easy thing to do. So we wanted to do it in a way where it felt Lightning. It has a spirit and the DNA of Lightning and you feel like you're using Lightning where you're using it. So I'm very proud of that. And that's something that was a team effort. All of this, by the way, has been a team effort collectively. I think I've seeded some ideas, but there's no way that we would've been here at all without the community and the team here at Lightning specifically. Lukas: Yeah. I totally want to talk about the Lightning launch that you just came out with recently. I'm super impressed by what you did there, but I guess I'm curious before we go into that, I remember a moment where I think PyTorch had something called Ignite, I think, that was really similar to Lightning or at least the PyTorch team thought it was similar to Lightning. I'm kind of curious— you were actually working at Facebook, I think... were you working at Facebook at the same time that Facebook is also making a somewhat competitive piece of software to you? And was that awkward? Did it feel competitive at the time? Will: So, two things. One, Ignite is not done by PyTorch and it's not a Facebook product. It is a third-party product where all they're doing is hosting the docs for it. So it's not actually built by Facebook or PyTorch. It just seems that way because of the way the docs have been structured. So, that's the first thing. The second thing is I was a researcher and a student and I was literally trying to build papers, not build software for a machine learning. So I wasn't sitting around using tools and looking around at stuff, so I had no idea that they were in around. I had no idea [?] around. The ones I've used are the only ones I literally knew about. You've been in research, I'm sure there's a ton of stuff that you're like, ""Oh, that's cool, but never used to it because I don't care because I'm doing my research."" Lukas: Totally. Will: So, I think it's a pretty normal thing for researchers to be pretty narrow focused. And I think it wasn't until it got launched that people like Alfredo and everyone else was like, ""Oh my God, it's kind of like this."" I was like, ""Oh interesting. What is that thing?"" And then I look at it, I'm like, ""Oh I guess kind of this like this, but it's got its own DNA."" So it's not surprising, though, it happens in research. You have people who are parallel working on something because something has happened that unblocks that so it's going to trigger similar ideas in a lot of people. But when they come up at the end, they're going to be very different things. My analogy is always like, if you and I are like, ""Hey, let's paint the face of a person,"" and just say, I describe the face. I bet you and I are going to paint it differently even though we're trying to do the same thing. Lukas: I guess what caused you to actually start a company around Lightning? What was that journey like? Will: Very interesting because the first adopter of Lightning was Facebook and that got us enterprise features very quickly. I mean, I was really annoyed because I was literally trying to do my Ph.D. I was like, we have this thing internally called Workplace where people message each other and I kept getting pinged by the Facebook team, not at FAIR, the actual people building all the fun stuff. And then I didn't check this thing. We tried exchange emails. I'm not the best at emails, so I haven't checked this thing literally for four months. And then my manager came in and was like, ""Dude, you have to check Workplace."" I was like, ""Why?"" And then it's these Facebook teams being like, ""Hey, we want to use your thing."" I'm like, ""Dude, it's a Ph.D. project, why would you want to do that?"" And they're like, ""No, it's okay. We'll help you make it better."" I was like, ""Fine."" Will: And so, they took it and started working on it and we've been super tight with the team since then. But then it was crazy because then big companies started using it immediately. It was like someone would submit a PR and they're like, ""Hey, can you fix this?"" I'm like, ""No, I'm not doing, I don't know, FFT research or whatever you're doing. I don't want to fix that."" And they're like, ""But I'm a Bloomberg."" I'm like, ""That's cool. All right. I guess I should help you out."" Will: And so then, as a developer, that's the best thing, you're like, ""Cool. My stuff is being used for real. That's great."" So I think when I had hundreds of these, I was like, ""Okay. Well, these people are really struggling with this bigger problem, which is what we just launched, so let's go ahead and really solve that problem in a meaningful way,"" but it turned out that you couldn't do it alone and you needed a ton of money and people and so on. And so that's how we ended up here. Lukas: And I guess what year was that? Was that 2019? Will: Yeah, that was summer of 2019. And then I left Facebook in December 2019. So started the company January 2020, two months before COVID. So, Lukas, you built a few companies, you've been successful and I'm sure you know how hard it is to build during COVID. Lukas: Well, I mean actually here we are summer 2022. How big is your company? Will: Yeah. Good question. So, we're about 60 people now, all over the world, and I think we've mostly clustered around New York, San Francisco and London, and then we have people everywhere else. I will say one thing that I'm really proud of in the company, again, I'm not from the US, I'm not from Silicon Valley, so I think that that's been the DNA of the company now. We have a ton of people from 20 different countries and it's amazing because everyone speaks all these languages. And it's pretty cool, you feel it's pretty international. So I think for a New York startup, this is great. It's exactly what you want, that melting pot. Lukas: That's awesome. What has the experience been like to go from a researcher, seller, developer to suddenly running a really significantly large company? Do you find time to think to write code on your own still? Will: Yeah, a good question. Maybe I'll ask you this. Don't you feel like building a company is kind of doing research? There are a lot of parallels, no? Lukas: I do think there's some parallels, but you go first. Tell me what you think the parallels are. Will: So, what are you doing in research? So you have a hypotheses and you're proven wrong most of the time and you got to just try something quickly and then move on to the next thing and try ideas, ideas and so on, until you find something that works and then you dig into it. That's no different than a company. The difference is you have to do it through people, which is really hard. So it's not just a solo person building, and I think people forget this. It's like, if you want to build anything meaningful, you have to have a team, you cannot do it alone. At this point, I have to tell you, I just said that Lightning took about five years to go live. If I'd been working with this team, it probably would've been, I could have gotten there in a year because it's a lot faster when you have really smart people around you and you're working together. Will: So I don't love this notion of the solo, whatever who did whatever— that doesn't work, guys, I don't do that. So it's been amazing. You have to build a company through people and that's really hard to do. So people management, taking a vision and getting everyone to go towards that same vision where they don't even know what the output's going to look like. That's really hard, because you're asking 60 people to just dismiss disbelief and say, ""You know what, fine, we're going for it. And when we get there, we'll see what it is."" And so, you have to trade that off a lot as a leader. And I think honestly spending the first six years in the military, even though I didn't do all the SEAL training that everyone does and didn't become a full SEAL, but the stuff that I did go through — especially leading small teams and training and at the SEAL team — actually did translate really well. It's like, how do you get an aggressive bunch of people to go towards a goal really fast when you have no information and you have limited resources? It's like, perfect. Lukas: Well, that's really cool. Tell me more about that. I'm really curious. What are some of the things that you learned about leadership in the military that you applied to running your company? Will: Yeah. I mean, if you show up to BUD/S as a junior officer — so I was 20 when I started SEAL training — I got put in charge of about 300 person class. That's crazy. And so, you have to be accountable for everything, all their gear, where they are and it's all 18, 19 year olds. They're all getting in trouble out in town, they're all doing really silly things. So, you're having to deal with a ton of people issues. And you're 20, you're learning on the job. And then you show up to your first SEAL team and then you're put in charge of a team, and those guys have been there for 30, 40 years. They're so much better than you in every possible way. So if you show up trying to teach, you feel like, ""Hey, I'm here, big, bad boss. I'm going to do whatever,"" you're done. Will: That's not how it works. So I think specifically, I can't speak for the whole military, but I can say in the SEAL team and special operations, you're taught to lead from the front. So, as an officer, you are supposed to be the fastest runner or the best swimmer, all of that because you're always leading from the front, so I still carry that here. So, that's why I'm not coding all the time right now, but I do want the team to be at a specific level and I can get there because I can push the team. And so, I think it's a lot about that and some mentality that if I'm going through that door, I'm going first and I'm going to be there first always. And so, a lot of those lessons carry over. So there are bunch of civilian terms for this, whatever leadership is called, but that's ingrained in me since I was 20, basically. Lukas: That's really interesting. Do you think there's any really striking differences about managing a company of mostly highly technical people distributed around the world that you were surprised by that's different than leading a team of 18 and 19-year-olds? Will: Yeah, for sure. So in the military, it's very dictatorial, I guess. You're like, you make a decision and that's it. There's no question, no one questions or anything like that. You, of course, take people's input and everyone has that. But at the end of the day, you say something and it just happens. And there's no second guessing, whatever. In the civilian world, oh my God, there's questions and this and that and blah. And so, you have to really learn how to live in that world. So it's fascinating. Will: I think the few years that I spent in finance were the best middle ground. And I actually think a lot of veterans have a hard time adjusting to the civilian world probably for this reason, because the way you do things in the military is just so different. So you can't approach people that way, you have to learn the EQ. So in finance, it's kind of this hybrid super aggressive ground, but you still have to learn how to talk to people. And so, if any veterans are watching this, I would urge you to go to finance first so you can learn a soft landing and then go into tech because in tech you're dealing with designers and creatives and people are very different there. Lukas: That's awesome. Do you think you have any role to play — this is a total aside, I'm just curious if you have any thoughts on this — but sometimes I feel like at least in Silicon Valley there's often a lot of friction between military and tech working together. Do you think about that at all? Do you hope that there's military applications of Lightning, and do you think you can play a translation role? Or how do you think about that? Will: Yeah. Look, I think that specifically AI in the military or like... everyone is ""autonomous weapons! blah!"" That's what everyone jumps to, and yes, that is an extreme use of it, for sure, and that's not a use that I want to support. I don't think any of us want to support that. Especially having been in some situations where it's pretty clear that you don't want to enable more of that. But I think what people don't understand is that some of these tools can be used in also positive ways. There are ways where you could, for example, I don't know, I don't even want to get into it because people are going to judge all the parts, but there's ways so you can use it still in a good way: translation, right? You're in the field and you're meeting someone in a new village and you can't speak to them. Will: How do you do that? A lot of what the military has done during the war has been around winning hearts and minds in Afghanistan and Iraq. And that's really making those connections with villagers and trying to understand what's happens and trying to rebuild countries and so on. And I think that a lot of AI could actually facilitate a lot of these things, right? Casualties. When you have casualties, you need to call something out, maybe the person can't speak, so translating or something. So there's some great applications of it, but it's like anything. Like, yes, can the internet be used to find your long lost family? Of course, it can, but can it be used to traffic people? Yes, it can. So what are you going to do, shut it down? You know it's hard. There's not a simple answer. Lukas: Alright. So tell me about the new Lightning website. What's the best way to talk about it, Lightning the operating system? I'm curious to know how you conceived of it and how you built it. It's such an impressive launch with some very impressive demos. I'd love to know about the process and your vision here. Will: Yeah, for sure. So if you go to Lightning.ai today, you're going to see the new homepage for the Lightning community. So, I think the first thing to note is PyTorch Lightning has grown. The project is no longer called PyTorch Lightning, it's called Lightning now. Because when it was just PyTorch Lightning, it let you do one thing which is build models. So that's cool except that when you build that model, there's a ton of other stuff you have to do around it. You need to wrangle data and you have feature stores. You need to manage experiments. You need to do a lot of the stuff that you guys are doing: analyze it, understand what's going on. So, what we are now enabling the framework to do — so the framework is now Lightning — it enables you to build models, still you can do that. But now, when you want to build research workflows or production pipelines, you can now do that within the framework as well in the Lightning way. Will: And what we really want to do is allow people to stitch together the best tools in class. So we're really thinking about it as the glue for machine learning. So if I want to use Weights & Biases ""feature X"" with this other thing, I should be able to, right? So really, I think you should think about us like Apple. We're really introducing the iPhone-equivalent so that people can build apps on there — so they can build their own apps and publish them — but these apps are extremely complex workflows, they're not just demos or something like that. These are actual end-to-end production workflows or research workflows that can run in distributed cloud environments, but they stitch together the best-in-class tools. So Lightning AI today is really the page for where these apps get published. So if you're trying to start a new machine learning project, you can go there, find something similar to what you're working on, run it on your infrastructure very quickly within minutes and then change the code and off you go. Will: And so, I think some of the things that I'm super excited about — and you and I have chatted a lot about this — is what are some of those integrations we can do with partners? And so, what are some of the great tools that we can enable, for example, from Weights & Biases there so that people can embed into their apps in really cool ways that probably are not possible today, right? And so it's really around that. I think I'd like to partner with every single framework and every single tool out there to help them shine and really provide the best capabilities of what they have for the community. So, I think that's what we're shooting for. Lukas: And I guess how long has this been in the works? It seems like a pretty different vision, as I understand it, than PyTorch Lightning. When it first came out, how did you come to it? And was this always on your mind ever since you started the company? Will: Yeah, for sure. So, that was definitely the vision from day one. It's really hard to build up front, so you really have to do the work for it, but that's how PyTorch Lightning had already started to do a lot of this. I mean, we were some of the first early partners there. So, when PyTorch Lightning first launched, we have to go back to 2019, I don't know, May, June, whenever it was: you had frameworks that were running. And if you wanted to watch your experiments or something, it was really hard to do, you had to integrate something. And so, you had TensorBoard, I think you guys were probably live by then, I assume. And it was like, no one knew about these things because they weren't there, they weren't easy to use. And so, one of the first things we did was... I personally used TensorBoard, so I used it back then and I was like, ""Hey. You know what, I don't want to start it out myself. Let me just let this thing do it."" Will: And so, we started integrating that in there and then very quickly your users started coming by and saying, ""Hey, can we add Weights & Biases?"" and so on. And then we came up with these abstractions and then suddenly people could use it implicitly. And that was amazing because it started to stitch together tools. So, that vision started back then already. And then if you look at the accelerators, so we wrote this API called Accelerate, which lets you train on different hardware, this is back in summer 2020, and it powers all of Lightning, but that's what it is. It allows you to go between CPUs and GPUs and TPUs. And I think we're the first framework to actually let you do that seamlessly. Will: So PyTorch supported XLA for TPUs and supported GPUs, but you have to rewrite your code over and over again. So we introduced for the first time, the ability to go between GPU and TPU, just like that. And that really changed the game. And so, that's been amazing because that was an integration. So it started to become a platform back then. And so, for me was, ""Okay, how can we do more of this?"" Except that in the model, you're very limited to just these kind of things. But when you start talking about feature stores and deployments and all that stuff, you need something a little bit higher level. Again, I'm lazy and I hate learning new things, so I was like, ""Okay, how do we make it just as easy as Lightning so that if you're not PyTorch Lightning, you already know how to build production systems?"" And so that's kind of what we released. And the hard part was getting it to exactly be like Lightning. What is that DNA? How does the user experience feel like? Lukas: I'm curious how you think about product development and customer feedback. It felt like you created a lot from your own vision. How much of what you do is informed by your gut, and how much of it is coming from a user saying like, ""Hey X, Y, Z, could you make something that does this or this or this""? What's your product development process look like? Will: Yeah. So I think I'm probably the worst person to ask this because I don't care what anyone is doing. I legitimately don't. I don't look at what people are doing. I don't care. We're going to do what we're going to do, and we're going to do things that I think are interesting. And so, we're going to basically form a thesis around something that we want to do and we'll see the behavior and the users, of course. But if you only talk to users... We speak to users all the time, by the way, so it's not about that. We take their feedback in. But users are always going to tell you incremental things. They're always going to tell you they want this better. They're never going to tell you they want the iPhone. They're always going to tell you, ""Can you make my Blackberry keyboard slide out instead?"" or whatever. Will: So you have to have just a different mentality there where you take things with a grain of salt and you do take their inputs, but it's really... those inputs are going to usually improve the product, but they're not going to help you create a leapfrog product. And so, that's really where, again, I just don't care what people are working on, I'm just going to do what I think should be done for machine learning and that's what we build next. And sometimes we're wrong and sometimes we're right. Lukas: Do you think it's important to hire people with a machine learning background to do the kind of work that you do? Or do you look for people with more like an operational or engineering or database background? Will: So, I guess, first and foremost, I care that people are creative, driven and interesting in some way, like they just have interest and they're not just the same cookie cutter persona. So, that's the first thing. Then after that, yes, I want you to be good at your thing, whatever your thing is. Now, specifically machine learning, it's nice to have, please, by all means, I hope you know what you're doing with it. If you are on the Lightning team, you'll want 1,000% need to know. And every single person on the Lightning team is a Ph.D. or came out of a Ph.D. program so they're all experts in this stuff. But everyone else who's around that, I just want you to be really good at your thing. And I don't care how you got that knowledge. I don't care. Remember, I didn't go... Well, I eventually went to fancy schools, but for most of my life I hadn't. And so, I didn't really care about that. So, I think machine learning is not necessarily a deal breaker, it just depends on your particular role. Now, I could be wrong... Lukas: How does the Lightning team fit into the broader company team? What's the distinction there? Will: So, the Lightning team works on all the open source stuff. And then we have people who work on all the closed source stuff. So when you run Lightning apps on your own, you're using all the free stuff. When you run it on the cloud, that's when you use some private proprietary stuff. So you can take a Lightning app, you fork the clone, even models and all that stuff, you run them locally. But if you want to run on the cloud, you say [?] cloud. And then that stuff is now being built by the other people who are not Lightning teams people. And these people are infrastructure people, they're database people, they're from all sorts of walks of life, I guess. And I think that diversity is always better in this world because there's just a lot of unknowns. And you and I both know this, that ML is evolving, we just don't know what's going to need to be built next. So we have to have a research hat on a little bit. Lukas: Are there top of mind applications that you hope get built on your Lightning platform right away? What are the next things that you're excited about? Will: So, top of mind right now is a few of these key partners that we've been working with for a long time like you guys, where we want to make the tools just more widely adopted and bring more visibility to them and have the ability for people to mix and match and more. So it's really about these immediate partners. Some of these include cloud providers, some of these include the hardware makers and so on. It's people that we've had really good relationships with for a long time. So it's about enabling those tools to work first. Will: In terms of capabilities, I do think that we do want to make sure that people have a really good way, I don't know, to do inferencing, for example. So we're partnering with the cloud providers to do that like SageMaker team and so on. And then I think for people who want to do anything with data, so would love to partner with like the Snowflakes and the Databricks of the world to enable these things as well. And then there's other labeling things that people are starting to do as well. So, I don't know if you guys are doing anything there, but obviously happy to partner in any of these. I think it's those things that are immediately around the model development part. There's a lot more that we can do, but we really want to focus on this part first. Lukas: Would you ever work with frameworks that aren't PyTorch? Do you like a scikit integration or XGBoost or anything like that? Is that within scope? Will: Yeah, for sure. It's crazy, people use Lightning for all sorts of stuff, but people have actually ran sklearn in Lightning. I don't even know how they did that. Lukas: That's awesome. Will: I was like, ""How are you doing this?"" Yeah, honestly, I love to integrate all the frameworks. I'm long PyTorch in general, but I don't have anything against TensorFlow and JAX and Keras or any of these things. So I think any partnerships there, we're happy to obviously work with and enable the tools as well. Again, I think that we've really evolved from where we were before to a point where we're saying, ""Okay, now that we're able to support a lot more than we could"" — before it's just a function of having bandwidth, right — ""now we can support a lot more than we could, we want to do that and make sure we welcome these partners as well."" So yeah, we're happy to work with any framework. Lukas: I'm just curious. Why are you long PyTorch over the long-term? Will: I think that a lot of these frameworks have converged in functionality, I guess. I haven't gone back and used TensorFlow and I think it's probably changed quite a bit. We've just done so much work already in PyTorch that I think we're just excited to continue improving that user experience. I think if Google wanted to partner with these other ones, we'd be happy to do that as well. But I believe that you can't really do everything well, and so, it's a function of having focus as well as a company. And anything in particular in PyTorch, I think it's really become the standard for research and also production nowadays. And I firmly believe that that team has done a really good job at continuing to push the boundaries. So I think that the energy, the way that the team thinks about things and how it's approached, even doing production workloads and inference, it's just very unique and different. I don't know. I like unique in different thinking, I guess. So I gravitate towards that. Lukas: I guess one of the things that I struggle with as we scale our company and our team... we hire all these really creative, smart people that have slightly different points of view and vision and stuff, and keeping things aligned and keeping consistency always feels like a lot of work to me. I'm curious how you've dealt with that, if that's been an issue for you as you scaled up to 60 people. Will: Yeah. I think you always want to take everyone's inputs into account, but you also want to be opinionated, and that's the difference. And I think that when everyone just says whatever, and then they'll do whatever they want, then you end up with something that isn't really cohesive. And so, to some extent, you got to be a little bit the bad guy and just say, ""Hey. You know what, cool, I get it, but we're going to go this way. And that's just the way it is."" And it's a lot of these micro decisions that get made. It's not just me, it's people on the team where I encourage them to be opinionated. And so, it's the same philosophy that we have for Lightning. It's like, ""Cool. You don't like subclassing things? Cool. Sounds good. Go use something else. We don't care. This is the way that we think it should be built and that's fine."" Lukas: Well, look, we always end with two questions and I want to make sure we get to them. So, the second to the last question is, if you had a little more time on your hands, or I guess, if you had time to work on something else in ML research broadly, what would it be? Will: Yeah. So, if I were back to doing just research right now, I'm pretty sure I would've continued on the self-supervised learning routes. I still track that work. I believe that — we published a paper about this a year ago, so I'm going to talk about that — but I believe that a lot of the things that have been pushed into self-supervised learning, a lot of those advancements, are actually not necessarily being driven by the methods like negative sample this versus that. I think it's actually being driven by the transforms. And so, the paper that we published a while back, I would've continued on this line is my answer, I guess. The paper that we published a while back showed that we could achieve very similar performance with like SimCLR using a plain VAE without any of the fancy tricks. And actually we removed one of the terms of the elbow loss. And why we could do that is because we took the SimCLR transforms and used them. Will: But then the way that we generated the negative samples was using the transforms and then you reconstruct the original. And so, that actually created a really good learning signal. And what that showed me and showed our group as well, was that it's not about the fancy negative sampling algorithm and whatever thing you're doing with, I don't know, information theory of whatever thing you're coming up with. It's that I think that we're just embedding most of these things into the transforms and the transforms are actually pulling the weight. Which actually is in line with what the data scientists have been saying forever, it's about the data. It is about the data. Will: So it turns out that we've just pushed all that knowledge into transforms now for images specifically. So, I'm a little bit sad about that, but at a minimum, I think I would probably continue on that route exploring, how can I reduce the complexity of these algorithms? I don't want these tricks. I don't want these weird learning rate schedulers and all this stuff. I want the super simple VAE loss or something super basic that I know why it works and I can pinpoint exactly why it's doing what it's doing. And I think self-supervised learning has lost its way in that most of these papers are like ""brand new paper that does this!"" and it's like, ""Oh, they changed this one tiny term."" And it's like, ""Come on guys."" Lukas: Interesting. Well, my last question is when you look at people that are trying to make machine learning work for real stuff, like companies like Facebook or Bloomberg or anyone, and they're going from like, here's an idea of something we wanted to apply machine learning to deployed and working in production, where do you see the biggest bottleneck right now in summer 2022? Will: It's like that meme where it's like expectation and reality. I think that's what we see all the time. Lukas: Yeah. Why though? Will: Yeah. I think there's a lot of them like where it's just unknown, like the thing is so new that you stress test it in a production system and things break and you're like, ""Ah, my chatbot is racist,"" or something. You're like, ""Yeah. Well, no one's employed a chatbot before."" So, of course, you're going to learn that lesson. So, there's a lot of new unknowns that we're discovering. But I think a lot of it is the explosion of tooling that's out there and the lack of a standard on how to use that tooling together. So, I think that's a lot of what's holding us back today. I think there are many ways to solve that problem. I think that we're obviously taking a stab at that with the things that we've just introduced. And so, I honestly think that's a big part of it. Now, I believe that that's only a part of it. Will: I think that the other ones are this fragmentation. Everyone wants you to go from this, to that, to that, to that, and then use this ONNX thing and then with this thing and that, and it's just like, if we just have a standard and everyone works together, we can actually do well. I honestly think there's a super unhealthy weird competitive thing in ML like, guys, this is a massive market. There's a ton of people who are going to pay for this thing. It's not about one or the other tool, everyone is using all the tools together. So this unhealthy competition thing is actually causing a lot of these problems. I think actually if the community worked together more and we had better communication and collaboration between frameworks and between open source projects and tools like you guys, then things would be a lot easier because we'd be speaking to each other and then some random engineer sitting in Facebook doesn't have to waste six months being like, ""Man, if they just did this one thing, it could have been so much easier."" Lukas: Awesome. Well, I hope we can find some ways to work together. Will: Just think of that one. Just think of that person. Just be like, ""I will get you your career back. Don't worry. That's the goal."" Lukas: Alright. If you're listening, we're rooting for you. We'll make it work for you. Alright. Thanks, Will. Real pleasure. Good talk. Will: Yeah. Thanks for having me. This is super fun. And by the way, I'm a big fan of everything you guys are doing. So I appreciate everything you've done for the ML community as well. Lukas: Awesome, likewise.",8975 +Aaron Colak — ML and NLP in Experience Management,https://www.youtube.com/watch?v=3vEj4IlAqao,3000,2022-08-26,"Aaron: When you do sentiment analysis on this huge set of industries — companies we are trying to help to listen to their customers and employees — out of the box models don't work. So how do you customize them? You can obviously go through customizing models to specific use cases, brands, or industries, but a much more powerful way is combining the power of these language models and letting the customers override the specific lexicons or rules and whatnot. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald. Lukas: Aaron Colak leads a team of machine learning engineers, scientists, and linguists at Qualtrics. Qualtrics is a super big company that you might not have heard of that takes large language models and applies them to real world B2B use cases. This is a really interesting interview and I hope you enjoy it. Aaron, thanks so much for doing this, I really appreciate it. I kind of thought a good place to start would be Qualtrics, and it happens to be a company that I know well, because I've worked with Qualtrics for a long time and a real star employee of mine, John Le, ended up over there. And so I know Qualtrics well, but I'm thinking a lot of our listeners will not know even what Qualtrics does. So maybe you can just start by saying what does Qualtrics the company do, and then tell us how machine learning fits into Qualtrics. Aaron: Sure. I think that would be a good starting point. Like many B2B companies, sometimes it’s a little bit not obvious when you just use the technical terms to explain what the company does. But when we get to the bottom of it, it actually is pretty cool, so let me start. We as individuals — human beings — use products, consume services every day. So every single usage of a product or a brand — being a customer brand, even working for a company or an organization, being an employee — from the individual's perspective is an experience. Every single transaction is, for us, an experience. Going to a restaurant, taking a flight, taking an interview, working somewhere for some time, is an experience. So what Qualtrics does is give our customers tools to design these experiences, track those experiences, analyze those experiences, and act on those experiences. So there's four pillars to experience management. Overall, we are trying to help our customers manage, find, detect, and fill those experience gaps. Our company roots started in a somehow interesting domain, which is surveys. Especially in social sciences and business schools, doing surveys is the primary tool to do research. And our founders were trying to help their father to build a survey tool, and then it just exploded! It just became phenomenally successful. And then at the next iteration, at the next level, those users of surveys — students and researchers — came back into the enterprise setting with different problems: trying to understand what customers think. And that became our... we shifted a little bit towards market research and understanding customers, employees, and whatnot. And eventually, over the last few years, we've been working on this new category, which is experience management. So we would like to think of ourselves as founders and leaders in that space. Lukas: Could you tell us a little bit about the scale of Qualtrics? Aaron: Sure. That’s a great question. I like to think about scale in a couple of different ways. One of them is obviously the number of customers we have, and we have tens of thousands of customers. But I would like to think about other ways, in terms of, actually, a number of touch points going through our systems — the number of experiences we are analyzing — and also the diversity of the type of experiences and channels from which you can actually track and improve experiences. So in that respect, our systems analyze millions of experiences every day, and we have different channels and modalities: social media channels, or other input channels such as surveys, tech surveys, web surveys, mobile, and social media call centers. There are various modalities we are actually collecting and analyzing this data from. That is kind of, for me, one other aspect of scaling. Lukas: And so what are the really important ML applications to Qualtrics? Is it somehow processing those surveys in different ways? What's really at the core? Aaron: Absolutely. So as I mentioned, surveys are one of the most important channels, but it’s not the only one. Even in surveys, there’s definitely a big part of the data that’s being structured. And in my opinion, experience data is most easily, or most naturally, expressed in unstructured data. So one of the most obvious places where ML comes into the equation is analyzing the unstructured part of the surveys, such as open text questions: analyzing sentiment, emotion, effort, and finding what folks are talking about— employees or customers. What is the individual team-specific sentiment or emotion? This is the sort of stuff where obviously machine learning is utilized mostly, but obviously there are other aspects. For example, the minute you go into a call center, then comes conversation, which is a totally different beast. Lukas: Can you make this more concrete for me though? Tell me one common survey that people might not realize happens. Aaron: Right. I think most folks would probably know about CSAT and NPS. NPS stands for “net promoter score,” and CSAT stands for “customer satisfaction score.” These are well-established industry standards, where businesses basically ask their questions to customers, and those surveys can be structured — or the input channel can be structured, depending on how you score and what you express. And you might be actually just filling in your experience. “How was your experience, Lukas?” You might be just filling it with, “Oh, the price was great, but the service was not good,” right? And you might imagine big enterprises— when they try to listen to their customers, there might be literally, possibly, practically infinitely many topics customers might be thinking about. So how do you detect and act on this? So this is where in comes machine learning, specifically NLP. Detecting what your customers are talking about — is it the price, is it the service quality, is it the taste of the food — and then, “What is the topic-level sentiment on this? What’s the emotion on this?” Things like that. I hope that made it a bit more concrete. Lukas: Yeah. I just want to make it even a little more concrete for people that don't know. We actually measure NPS religiously at Weights and Biases. People sometimes complain that we're asking too much, but we really love to ask NPS and that's a measure of ""would you recommend this product to a friend"" on, I think it's like a scale of one to 10, right? And then you take the nines and tens and subtract the one through six or zero through six, those are the low ones or the detractors, and the high ones are the promoters. And it’s sort of a sense of “are people liking your product?” And then, I think, is CSAT the one where it’s like, “How would you feel if the service went away? Would you be disappointed or not?” Is that right? Aaron: Could be. I think depending on the context, a good way to think about CSAT and NPS is a little bit like in the following way: CSAT has to be focused on transactional experiences, whereas NPS is more about relation — your relation to that service provider or company or brand — taking everything into account. “How is your overall experience?” Lukas: And sorry, and CSAT is like one moment in time or one thing that you did? Aaron: Yeah. Transactional experience. Yes. Lukas: And so what's a typical question that a Qualtrics customer would have about all the NPS data they're collecting? It sounds like they maybe want to know what are the themes of things that people are unhappy about. Aaron: Correct, exactly. I think this is the most canonical use case. Everybody is obviously… every company cares about customers. Right? And this is really one of our biggest motivations because… as you are familiar from the big tech companies— some of them are phenomenally successful with their customer obsession. So how do you enable the rest of the world who doesn’t have an army of data scientists and engineers to listen to their customers? And employees too, not just customers. Our tool can be used in these different settings to listen to different personas from their experience perspective. Lukas: And now, when you analyze this freeform NPS survey data, just to use a specific example, do you come at it with a specific set of categories that you're interested in, or do you kind of draw themes out with clustering or something like that? Aaron: That is a great question. Yes. So there’s two types of experiences — speaking of experience — with our products if you want to do analytics and open-ended questions. This is where you usually find, especially, emerging new stuff. You can go with what we call “industry-specific libraries,” where our domain experts and industry specialists collect and create these libraries of topics. But as we know, the world is changing fast, especially in certain industries. So how do you stay on top of things? How do you find emerging new stuff? How do you make sure things are not under your radar? This is where ML comes into play. We actually have deployed machine learning — and things like topic detection, key phrase detection, things like that — and surface that. We’ve taken the temporality dimension into the equation, of course. And then that’s where they dig in: either curated, ready to consume libraries, or finding topics on the fly. Lukas: And now, I feel like natural language processing in the past few years has moved faster than maybe any other field in ML. And suddenly you have this explosion of large language models, which are quite evocative in terms of text generation. But, I'm always wondering how much this effects businesses like Qualtrics. Do you use large language models a lot, and if so, where do you use them and not use them and how are you thinking about that? Aaron: Right. So I would like to first mention a disclaimer: obviously I’m a big fan of ML and deep learning. Having been at the University of Toronto during my grad school when all these things were happening, you can’t escape that gravity obviously. Lukas: For sure. Aaron: But also having been in ML for so long, our approach is pretty much going after what our customers’ needs dictate. As you know — as you’ve covered in this blog many times — large language models, contextual models, cross-lingual models: they’re game changing. Right? If you use it the right way… if you identify the right situation for them, they can be really powerful. And we do. We do use large language crossings models a lot. But you might be surprised that we also use rule-based systems a lot. I think rule-based systems, heuristics, they enable you to call— not only fast and be scrappy, but they also enable quite a bit of customization. Because when you use large language models, when you do sentiment analysis on this huge set of industries — companies we are trying to help to listen to their customers and employees — out of the box models don’t work. So how do you customize them? You can obviously go through customizing models to specific use cases, brands, or industries, but a much more powerful way is combining the power of these language models and letting the customers override the specific lexicons or rules and whatnot. So yeah, we have the full spectrum, starting from classical linguistic analysis, lexicons, all the way to the bleeding edge deep learning models. Language models. We use the full spectrum, and I think the future includes a hybrid — for us, at least — for the foreseeable future. Lukas: And when you use these large language models, where are you getting them? Are you using Hugging Face or are you using some of the APIs out there, like OpenAI or Amazon or others, like how do you think about that? Do you feel like it's important for you to train your own? Aaron: Yes, because for a lot of the problems we are looking at, just taking a model, training it and tuning it to the domain… there are obviously problems that can be solved with just simple tuning or domain adaptation. There is a large spectrum of problems where it’s not sufficient for us. There’s also the whole aspect of when you operate at this scale, we are dealing with millions of short and long text conversations: we also need to care about scale. So model compression is a big area for us as well that we’re focusing on. We do use, pretty commonly, the xlm roberta-type of models. We experiment all the latest and greatest stuff that’s coming our way, and we pick the right model for the right setup. And if need be, we’re also customizing them in terms of the downstream application and in terms of combining the language models with other modalities and whatnot. And how much customization we do in the model — how much tuning, where do we phrase — it all depends on the exact, specific use case. Lukas: Do you feel like the advances in NLP has changed your approach to machine learning over the last few years? Aaron: Absolutely. I think — again — it’s a powerful, powerful tool. It’s not a silver bullet for everything, but — especially in an organization like us, who tackles multilingual data in low-resource languages — it has been a big, powerful tool. Lukas: Interesting. So how do you use it for low-resource languages? Aaron: I have to be careful here, because when I say “low-resource language,” I might not be using the exact academic sense of “low-resource,” because that seems to be a bit of a moving target these days. Lukas: Sure. Aaron: What I mean is — from our perspective — obviously, every business has a target depending on where they operate — and what kind of products they’re developing — and has business priorities in terms of languages they want to handle. And the amount of English data brought into the raw, labeled or customer feedback data, which we used to train these models… obviously, for English… we have disproportionately more English data than, say, even some common European languages. And some of the success we got from these models — just purely based on zero-shot learning — was sometimes more than enough for getting a POC out there and then iterating on it. Because more often than not — I don’t know, having been in this field multiple times — getting training data, labeling, can always be an issue. Like, sometimes it’s the “chicken/egg” problem. If I have one product out there, then I will have customers doing some edits for me, or giving feedback and data. But how do I get that if I don’t even have a model working? So zero-shot is, in a way, game-changing in that respect. Because it enables us to — as long as you set the expectations right — get something out there, make it a win-win situation for you and your customers, while data starts pouring in and you iterate from there: from the feedback, from the implicit or explicit labels that come to your system. In that respect, it’s been game changing. Obviously, when you keep practically… In the past, if you wanted support for any NLP system in X languages or K many languages, you probably need K many language experts, K many sub-teams working on those. But right now, again, it’s not a silver bullet for every use case. But for a lot of the use cases, simple investments and small data sets can go a long way. There’s also, actually, changing the paradigm in a different way too, Lukas, in my opinion. In the past, when you were doing an NLP project, every single project — whether it’s the same project in a different language or a different functionality in the same language — would require, almost from a data perspective, getting to ground zero. You start from ground zero. You cannot share a data set, pretty much. But these pre-trained language models combined with cross-linguality enables us to basically do a lot of new ideas, new projects, for a fixed amount of budget, just because the amount of data you need to tune to a new feature, a new language is just significantly smaller. And that has been — for us, and for many others I know in our industry, in technology and the NLP space — has been changing the way they look at data. Lukas: Interesting. So how does this actually work? So you'd say you have some French language survey results. When you say ""zero-shot,"" do you mean that you take some kind of embedding and put it into some comparable spaces to English, or how do you actually approach the rarer languages practically? Aaron: Let’s take an artificial problem, like a text classification. I want to classify to ABC— its sentiment and whatnot. We actually shared this in our blog post, but basically you train for English. You may have more data on it, or label data on it. But for French, if you have limited data, the least you can do is use that data for testing “how is my zero-shot performance?” Right? And sometimes we have little data sets that are in these languages and we say like, “Okay, from a test perspective, it looks good enough to get,” or even “pretty satisfactory to coming close to the English performance” or whatever is the performance metric we want to hit. And if they’re not, you kind of get an idea of how your model is doing. That might be enough for you to get a V0 out there and start collecting data from a feedback perspective because our systems allow — not all, but some of them — allow our customers to give feedback in terms of our predictions. And then, that basically… compared to how we would do these kinds of things… not even five years ago, you had to go and start from scratch in French. Lukas: Yeah. Yeah, it's really impressive. It just works reasonably well right out of the gate, typically. Aaron: Usually. But again, there are problems sometimes. Not all of them. Nevertheless, very, very useful. Lukas: Do you end up training separate models for each customer then? Are you fine tuning new models in every single customer's pieces of data? And if you have like thousands of customers, does that create a huge logistical problem for you? Aaron: Yeah, that’s a great question. First of all, a couple of years ago, even if that were the right thing it would not be feasible. Practically, very challenging. But these days, fortunately, hyperscalers provide a lot functionality with various services in terms of multiple model endpoints and asking for its predictions and whatnot, even if they have a model endpoint for every single language or task you have. But we do a combination. We try to use multitasking as often as we can. It’s just a powerful tool. Obviously, more often than not, you basically get a linear — proportionally linear — return on your combination of the tasks. So instead of having N models, you have basically one model doing N task if possible or applicable. Obviously, from a model life cycle management perspective it generates its own challenges as well. And that needs to be taken into account in the long-term design, because then you’re coupling models. If model requirements change and you need to update one, or one task, do you really need to update the other task? Lukas: And how do you think about what a task is here? So I'm imagining if you have- Aaron: Let's give a concrete example. Lukas: Yeah. Aaron: Let’s say you’re trying to predict sentiment on a given text. And you might imagine sentiment and other related, more nuanced dimensions of human experience — emotion and whatnot — or other things you want to predict about this intent. So, same input, and you basically can predict at the same time with a single model the emotion, the intent, and the sentiment at the same time. So these are individual tasks. Right? It doesn’t have to be a single prediction: it can be a classification task combined with a sequence-to-sequence task. Doesn’t matter. So— Lukas: But if you're doing this on behalf of two customers, do you consider that to be a single task, or does each customer's look like a different task? Aaron: Yeah, yeah. Yeah. Sorry. That reminds me— I didn’t quite answer your first question. We don’t do… As I mentioned, for customer specific needs we tend to think in terms of giving a customer the full power to customize or override the behavior of the model. That comes through using various enrichments we do to the text on top of whatever target task. You can also do all the linguistic enrichments, and you can combine these linguistic enrichments with rules and other heuristics to actually override the model behavior. There are some initiatives going on… I’m not at liberty of discussing here, but we are thinking of enabling customization on the ML level, as well. This is not implemented yet. Lukas: I see. But I guess, does allowing the individual customers to kind of customize what the output's doing, I guess that means there's a single underlying model that's feeding the customers and then they sort of override it? Aaron: Right, right, right. That’s a good point. I think I should have made that distinction upfront. So we have two types of models: one of the models is what you call a “universal model” — these are models that work for all customers in the same way irrespective of who’s sending the data. But we also have customer specific models. For example, we have this tool called Predict iQ, where you can use experience data — which caused X data, or operational data, all data — a combination of those to build predictor models, starting with churn prediction. Actually, that’s one of John’s products! So the product, Predict iQ, by definition is customer specific because you as the customer bring your own data, define what your variables are, or let our system kind of do AutoML and automatically build, train, optimize, and deploy a model for you. And as new data comes in, you can actually… in a certain fashion, you can predict the… So, customer-specific model and universal models. But, I understood your question — initially, by mistake — was, “What about these universal models? How do you customize them?” So our approach to that is letting the customer override behaviors so it’s not completely ML-based. But we can envision a future where we can totally, completely let customers give feedback and continue to train these models. This is more just thinking right now— there are no concrete plans or commitments on that one. Lukas: And now I guess processing language data is pretty different from predicting churn. When you think about predicting churn from survey results or something like that, does deep learning have any role to play or do you go to more traditional models for that kind of tabular data? Aaron: That’s a great question, Lukas. The interesting thing is that people have been trying to extend some of the ideas that came from transformers to tabular data, as well: I think there are some variations developed specifically for tabular data. But I’m not convinced that the concept of pre-training — which is where most of the power, I think, for these language models comes from — doesn’t quite apply, at least not in our setting. Everybody’s customer data, everybody’s transaction data is different. The semantics of it are different. That being said, there is a future we are investing towards where we’ll be able to, hopefully, give our customers options to schematize their data: to map them to a shared schema. And when that happens, obviously, things change a little bit. Then you can actually envision a future where learnings can translate — global patterns can translate — from one data set to another. Lukas: And what would this mean to map to a schema? Does this mean kind of standardizing the customer names and standardizing the definition of churn or something else? Aaron: Right. Not quite that. More like, for example, think about the following: say you have a question about NPS, like, “How likely are you to recommend company or product X to your friends?” You can imagine this can be expressed in many, many different lexical and semantic forms, and different languages. So capturing that question— identifying, “Hey, this is the same question, and this is the same. This is an age question. This is income question.” Right? Basically, you’re structuring the data in that way and identifying the fields and numerical values and ranges and whatnot. Then data becomes mappable, data becomes transferable, learnings can become transferable. Going back to the Predict iQ problem, yes: we use deep learning there as well, mostly canonical techniques. But not surprisingly with tabular data, tree-based models are pretty successful, even if they’re not necessarily always successful in terms of performance metrics. It’s just much easier to work with them because they have this natural way of dealing with missing data, combining categorical and continuous features, numerical features, and whatnot. So there’s lots of ways. Lukas: Got it, got it. Aaron: One way of still using deep learning, again, in tabular data is obviously even in tabular data, there are… certain questions are still open-ended. Lukas: Totally. Aaron: Yep. Lukas: So I guess, what does your infrastructure look like? Have you standardized everyone on a single machine learning platform? Have you standardized the frameworks that people use or is it open-ended? Aaron: I think in many ways we are far ahead from past experiences, or from colleagues that I know when we discuss the ML ecosystem and the state of affairs. But we have one advantage in a way. Our ML platform development efforts are relatively new, so we leverage a lot of the functionality that these days are coming from hyperscalers. A couple of years ago, building an ML platform was a very big deal. Being able to support different hardware, different workflows, different personas was — even for a small ML team — a big, big deal. These days, we are using hyperscalers, obviously: moving a lot of the heavy lifting to hyperscale functionality. And most of the work we do is basically harmonizing our data and our workflows and expressing them operationally in terms of our platform, which is based on tools like SageMaker and whatnot. But yeah, our current ML training serving scientist workbenches are all standardized, yet this is a fast moving field. There are a lot of new systems. There are small and big players. You mix and match and try to leverage the best of both worlds. Lukas: Where do you feel like there are gaps? If somebody was listening and was thinking about making a company to do some ML tooling, where would you guide them? Or if our product team wanted to roll out something new, what would you appreciate the most? Aaron: First of all, having done ML for almost 20 years now, one of the things I most appreciate is that it’s kind of a dream come true seeing such a big ecosystem: things like, for example, experiment tracking. For those of us who went through grad school by tracking things, I always make this analogy: I used to work in the computational biology field, and a lot of my collaborators and peers have these really nicely organized experiment notebooks. And I’m like, I will never be successful in this field because I’m never organized in files and whatnot. But as computer scientists, we can still write scripts to organize, no matter how messy things are. But when I work today, I see tools like Weights and Biases, and other tools for model performance monitoring. Lukas: Do you have a favorite performance monitoring tool? Is there one actually that you use? Aaron: Not yet. We’ve actually narrowed it down to a couple of things, but we are still actively working on it. But my point here is that one of the things I strongly recommend to my team is leveraging productivity-boosting tools such as experimentation tracking, reproducibility. For me, the biggest gap is still in CICD for a couple of reasons. I don’t think it’s as well understood as other parts of the ML lifecycle. And there are different personas involved: data scientists, ML scientists, ML engineers, application engineers. That is a complex problem. I think the nature of the problem is complex. Solving that problem seems really, really big. Some actors — including yourself — are doing some really interesting things out there, so I’m eagerly observing this field. I think some of the core infrastructure problems — in terms of ability to support different hardware combinations, scale, and all that stuff — that’s been solved to a large degree these days. To me, the next level is really winning these scientists and MLE personas by building something they can connect to. Because I see the adoption of these tools as still — owing to being a new industry — I still think the adoption is not quite there. So, yeah, CICD: I guess I would put that in the top there. Also, depending on your application area, monitoring. And the third I would probably put — depending, again, on your industry application focus area — monitoring but with more focus on that from an operational perspective, and more from a fairness and bias perspective. These are obviously good things to pay attention to, and there are also — these days — societal and legal reasons to pay more attention to these kinds of systems and regulations. Lukas: Is there any tooling that you're using or have built to help with fairness or any kind of explainability at Qualtrics? Aaron: Right. We are definitely looking at that, because we know our systems are being used in context-sensitive applications. I don’t want to disclose any specific names, but one thing that’s happening in this space is that testing AI systems — developing, testing frameworks, behavioral measurement frameworks for them — has fortunately taken off lately. So there’s both tools from academia papers — papers, tools — as well as industry. I haven’t seen industry adopting it as much. I might be wrong there, to be frank. There’s still, I think, some way to go there, but this is becoming… We are definitely looking at it. We are looking at our models, how they’re behaving under certain… whether it be gender bias, other social identity biases. But bias can creep up in many ways, so this is going to be a continuous effort in our agenda. Lukas: How do you think about building your team? I guess, how is your team structured now, and what skill sets do you look for? Aaron: Well, let me start with what my team does. We deal with basically all things ML, from building the ML platform, to working on building dataset anthologies, libraries for NLP applications and beyond. And then I have two applied science teams. One of them is really focusing on NLP analytics applications. As I mentioned, we discussed a lot about surveys, but surveys are basically solicited feedback. Qualtrics — you would be surprised — we are looking at more in terms of volume of the data, actually. Much more text data is coming from other channels, social media and customer support applications. So for analytics, obviously we have a large team, and then we have made certain investments in this area to really grow our footprint and export it in this one. The other team we have is focusing more on infusing ML to all our product lines. And that includes more canonical applications — from time series modeling, anomaly detection, recommendation systems, path optimization, yield optimization — to fraud detection and things like that. And this, depending on what business… For analytics obviously we’re looking for subject metrics first. Right? Though, as much as we love and use deep learning where we hire deep learning experts, we are also looking to make sure we are linguistically grounded. So we have a lot of linguist experts who are actually building very deep linguistic package analysis to make sure we marry the systems in the right way to solve our customers’ needs. On the more canonical problems, we try to have a diverse team from a skill set perspective: deep learning, statistics, engineering. This field requires really going fast, solving problems, and not necessarily always coming up with a new approach or bleeding edge algorithm. Lukas: Interesting. And I guess, do you think there's anything that specifically makes somebody successful at Qualtrics? Or on your team outside of the normal things that a company would look for? Aaron: Sure. So Qualtrics ML is… We’ve been very focused on… Over the years, as our vision evolved, data and ML have become more and more central to our business. Because listening to these different channels — different data models, understanding and predicting an ability to give actionable data to our customers — to me, boils down to deep data skills. And we have a lot of ways to leverage this different data, marrying experienced data with operations data, and we are uniquely positioned to do that. So somebody who… Or, maybe I should even answer wrong, “Why consider Qualtrics if you’re working in, for example, in the ML field?” I think it has a lot to do with the uniqueness of the problems and the data sets. When we look at the spectrum of problems, yes, we do have a lot of problems you can immediately relate to, but there are a lot of problems that are very unique that don’t exist in other fields, or the data sets don’t exist in other places that are unique. Obviously the volume is there, the volume of the data we’re tackling. But someone — I’m particularly speaking from experience for my team and myself — developing ML applications in a B2C setting is very different than a B2B setting. You’re dealing with very different customer personas. Supporting the ML cycle, when you think about the model life cycle’s ability to refresh, the implications of that are much more permanent in an enterprise scale. It’s like switching one model just because a new, better result is not as fast… or you don’t have as much degree of freedom as you would have in a B2C setting. I might be overgeneralizing here, but that’s my own personal experience. What else… I guess being B2B to working on a very unique data set and problems where it’s not always easy to go look up a paper and implement the technique, you need to really be creative and synthesize new solutions, come up with new ways to look at the data. Lukas: I guess, looking at your career, when you came from school into academia, you went to Amazon, right? What was the biggest surprise— I mean that's always kind of a shock for people I think, going from research to practical applications, what was the biggest surprise for you? Aaron: The biggest surprise for me… well, actually, 2022… it was exactly ten years ago when I was doing a graduate internship. That was my first industry experience, and I was very academically-oriented. It was the usual thing: writing papers, going to conferences, and trying to look out for the next step, which is post-doc. For personal reasons, instead of spending a research summer, I took an industry internship. And instead of ending up in a research lab — because of visa problems — I ended up in a more industry application-type lab, and I tremendously enjoyed it. Because, up until that point, I always thought I enjoyed really tackling tough technical, scientific, open problems. But this is when I had the realization that I just like solving problems. And, being in the same space in ML — where you’re still an applied research field — your every day, pretty much, is filled with some uncertainty. You still have that everyday unknown and excitement about what’s going to happen. “Will this experiment work?” You’re always continuously thinking, creating, looking at the data— everything changes. It never gets monotone. For me, it was never like that. And then, I was making this joke to my team members, but to some degree it’s true: the fastest return for your work — like writing, I have written my fair share of papers — but here I see things going to production. It just gives a different sense of accomplishment, solving problems. And even today, when I look at what we are doing at Qualtrics — helping our customers solve their customer problems — I think it’s an amazing feeling. And that just keeps me going and focused on staying with problems, even though sometimes the data or technical problems might be very challenging. And you are — I know it sounds a bit cheesy — but you are changing the world. I just had this terrible experience with one of my home projects, and I feel like I’ve sent 30 emails, nobody even bothering. I’d like to think in my world, one day, somebody at the end of that thing, that there’ll be a tool and they’ll think, “Hey, Aaron’s experience is broken, let’s surface it up, let’s do something about it.” This is the future we are building for Qualtrics. Lukas: That's great. I've definitely come to believe that listening to customer experience survey results is one of the real keys to building a successful company. So I actually totally identify with that. It's a good segue actually to our last two questions. And the second to last one is basically, what is something that you think is understudied in machine learning? If you had more time or if you were back in academia, that you would spend some time looking into because you think it would be valuable? Aaron: I wouldn’t say… perhaps maybe not understudied, but one thing I’m waiting to make a big splash is causality. I must admit, before Qualtrics, for a couple of years, I worked in the healthcare space. And surprisingly healthcare is both super rich with a lot of interesting ML problems — very meaningful problems — but also, for various reasons, it’s also going a bit slow for regulatory and other problems in that space. It’s been a fertile field for a lot of causality research, but we have often, in that space, these recommendation systems where we can do some treatment, we can see how these systems can actually make a big boost in terms of how the real way we should think about ML or we should think about stochasticity and predictive systems. But one field — just because of the sheer complexity of obtaining treatment data — we need to work with observational data in most settings. And I know there are recent interests in making causality work with observational data, and that would be, I think, game changing for a lot of applications. But maybe there’s not enough investment being done in that field, or it’s just fundamentally a hard problem that we need to be patient about. I don’t know, but that’s one field where I’m keenly observing on the side. Sorry, waiting for, yeah. Lukas: Interesting. And I guess final question, when you think about going from an idea of a new application to deployed, working in production, what's the biggest bottleneck? Aaron: Ah, oh, there's a classical question. Lukas: So is this a really a classical question, or is this just a question I ask all the time? Aaron: Classical, I'm sorry, maybe classical is not the right term. Sorry, deep question. Lukas: Deep question. Important question. I agree. Aaron: The reason is that… we know ML is… everybody’s excited about it. ML has proven its value. But is ML delivering at the scale it’s being invested in? Probably not. There are all sorts of market research reports out there showing how much ML is failing, why it is failing. I think this boils down to that question. Most of the time, it’s going from that proof of concept to production. In my experience, depending on the setting you are in, there can be a couple of reasons contributing to that. One of them is structural, probably. And this is where the most common cases I’ve observed in my experience — from startups, to enterprises, to hedge funds, to other places — it really requires — if you’re working for ML, unless you’re doing platform work really — if you’re working with ML for a product feature, that requires a really close connection with the ML folks and with the product folks. Time after time, ML folks go build models, not cognizant of the underlying production constraints and whatnot, solving sometimes not the problem that the product requires. And that’s not specific to ML: that’s a system design problem. You go design the wrong thing, or you design the system that’s not with respect to the constraints that system needs to work within. What particularly becomes problematic in ML is that if you don’t really have that structural support process in place, scientists — especially those working on, maybe not a current application but a deeper, technical problem space — they usually don’t know what having a model in production looks like, from productionalization, from latency, from input, from output, from monitoring, from a system design perspective. The way we solve it in Qualtrics is that we empower ML engineers. ML engineers— they know ML, they know they’re engineers at heart, by training, and we include them from the getgo. They’re in from conception all the way to the product launch. So they play a very critical role between how this model’s going to be used and what’s being designed and moderating that. To me, that’s the essential role machine learning engineers should play. That’s obviously a very biased opinion, because machine learning engineers or data scientists and applied scientists… I don’t think these are universal definitions. Every company goes with their own way of what’s going on. But I’ve seen that when you don’t have a person who understands both domains well and gets involved in the processes in place… and I’m not even counting all these infrastructure issues. I’ve seen places where they’re trying to do NLP in traditional microservice architecture and places like that. You don’t have the right architecture. Even if you have the right infrastructure, I think it boils down to having the right people with the right skillset and having a process, really. A clean process. So you don’t have basically everybody doing everything. That’s where things start to break down. That’s how we do it in Qualtrics. We have dedicated roles specializing in different aspects of this process, but always working together end-to-end. This is what we call “the trifecta model” — a machine learning engineer trifecta model — the machine learning engineers, the product engineers, and the ML scientists working together. Lukas: I see. Cool. Awesome. Well, thanks so much. I really appreciate your time. Aaron: Of course. Lukas: And it's fun to talk to someone who's deploying so many models in production, especially at a B2B company. You don't hear as many stories of this, so thank you very much. Aaron: Yep. Thank you, Lukas.",7417 +Jordan Fisher — Skipping the Line with Autonomous Checkout,https://www.youtube.com/watch?v=08VCMjPQRPo,3478,2022-08-04,"Jordan: Throw the data at it, get an AutoML baseline and then see if you can do better. Maybe you'll be surprised, maybe you can't, right? But your job is not to build models. Your job is to have a business impact. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Today I'm talking to Jordan Fisher, who is the CEO and co-founder of Standard AI. Standard AI is an autonomous checkout company that actually has autonomous checkout working. So this is an amazing conversation with someone who has big deep learning models really working in the real world in real conditions. It's a very informative conversation. Jordan, it's good to talk to you. It's been a long time. Jordan: Yeah. Lukas: I thought it might be good to start by saying a little bit about what Standard AI is, for folks who don't know. Jordan: Yeah. So I'll give you the quick spiel on Standard. We build computer vision powered checkout for retail. Probably a lot of people have heard of Amazon Go. ""Go for everyone else"" is the way we think about ourselves and in particular, we're trying to make it easy to just pick it up and put it into existing stores. Jordan: It's a camera-only system, we install onto the ceilings of existing stores. And then we do all this magic behind the scenes to ultimately figure out what do people have so they can get on with their day and skip the line and get their receipt automatically. Lukas: And I guess what's the founding story? Jordan: There's two pieces to this. Piece one is just, I despise lines. Lukas: Nice. Jordan: I mean, life is short. I struggle seeing just wasted human capital, right? We spend literally billions of hours waiting in line every year. It's pretty staggering when you stack it all up at once and just look at it, that amount of human capital, just literally incinerated. Jordan: And we're doing it just for the sake of commerce, right? We're just waiting in line doing nothing. And there's another person on the other side, just waiting for us to get there, to then do this transaction. It's the most mind boggling thing. So that sucks. Life is short and we shouldn't spend it waiting in line. So I think that's the obvious piece. Jordan: But the real sort of inception of Standard was more tech-driven. We had a really cool tech team that I was working with closely at the SCC and a few folks who were sort of in the surrounding industry. This was going back six-ish years now, and we just saw this revolution happening in ML, obviously, but really computer vision at the time, I felt it was having this moment where it was just so clear that everything was about to change, that you could suddenly reach human parity almost in some of these tasks. Jordan: And, ""Wow, if that's true, if humans and machines can sort of see equally well, what does that mean for the world?"" It should change basically everything, right? Every industry where you can put a camera and see stuff should get revolutionized by this coming revolution. That was our really strong conviction. Jordan: We didn't have a particular product in mind. We just sat down and said, okay, we're going to... Actually, we weren't computer vision experts. We were just ML experts. We were building ML for the SCC. And we said, ""Okay, well, we're just going to retool."" Jordan: We sat down and for about a year, we just read every computer vision research paper that was coming out. We had one business guy, a really good old colleague of mine, who was sitting in on the brainstorming sessions and was just helping us do the market analysis to TAM and kind of figure out what was what, and we had a bunch of really dumb ideas, which I'll tell you about only over a beer sometime. Lukas: Oh come on, tell me one right now, I want to hear one. Jordan: Actually, one of them is starting to become more real in a more home setting. Jordan: I was really passionate about this idea, which was smart gyms and this idea of...because we were pretty sold already on overhead cameras being the modality that we wanted. We're like, ""Where can you just put overhead cameras and enable a cool experience? Well, gyms, right?"" Jordan: Just put the cameras up, and then you put your AirPod in your ear, and you should be able to just walk into a gym, and your synthetic personal trainer just starts talking to you. It knows what you've done in the past. It starts counting your reps. It pushes you to do one more rep or go for the 15 pound instead of the 12.5 pound. And it takes all the drudgery out of doing the self tracking, which no one wants to do and can also kind of push you to do more. And then help you with proper form, et cetera. Jordan: We're starting to see this now in at home gyms where there can be a camera that will help coach you, but I don't think we've seen it yet in the full gym setting. Lukas: But then you decided to do this checkout-less store idea. Was it obvious that was the best idea? How did you come to it? How did you validate it? Jordan: It was super obvious. Because when you start running the numbers, it's just wild, how big the opportunity is. And it's only gotten bigger since then. Check out on its own is huge. Like I said, it's literally billions of hours a year across the world. It's just massive. And we spend hundreds of billions of dollars to make that happen. Jordan: There's so much else that needs to happen in stores, right? That human capital has an insane number of things that we need from it in these stores. Stocking the store and customer support and refacing, et cetera. We hear that now that we're talking to retailers, they want all these things. Jordan: Of course they want autonomous checkout because they don't want a line either and they want to do the best thing for their shoppers, but they also just can't even staff their stores effectively right now. There are retailers that have 10-20,000 plus open positions right now. Jordan: Our product teams especially go and interview not just the retailers, but employees in the store. And they get super excited, as much as anyone else about autonomous checkout. Because they're like, ""Oh, if we have this in my store, I get to go do all the things that I want to do,"" right? I get to go interact with customers. Jordan: I have my locals, my regulars, that I really like talking to. I can finally have a spare minute to go fix the out-of-stocks that are actually hurting the bottom line of the store, because someone walks in and wants their Snickers bar and we haven't had a chance to restock it. That's a massive hit to retail. Jordan: So really, everyone's super excited about it. All these industries — these retail tech industries — can be done better with computer vision powering them, right? It's inventory and out of stock and loss prevention and insights and analytics. And then it's also checkout, right? Jordan: We started pulling back all the layers of this onion. We're like, this is really going to just change the entire 25 trillion dollar physical retail industry. Every aspect of it gets better once you have a smart system in the store. So we were just like, ""This is insane."" Jordan: That was one of our metrics, obviously. A huge TAM. Another one of our metrics was we wanted it to be a really hard tech problem. Just from a personal satisfaction place, we love working on hard tech. But then my personal preference is working in super rarefied industries where there's a huge barrier to entry from a technical challenge perspective, because it rarefies there, right? There's only a handful of teams that are going to be competing with you. Jordan: It was kind of that sweet spot of, ""This is really hard, but it's not quite as hard as autonomous vehicles, where we're going to be bashing our head against this for a decade and probably need to go raise a trillion dollars to compete with Waymo."" It hit all that, hit the sweet spot of all those things. Lukas: Is the challenge to actually see when someone takes something off the shelf? Is that the challenge, or is there a point where you sort of show it to a camera and then check out? How does the experience work? Jordan: Yeah. From the experience perspective, we're really trying to make it just completely seamless. You forget that you're doing this, that you're shopping. The goal is to make it feel like it's your personal pantry. Jordan: Just walk in, grab stuff, put it in your jacket, put it in your pocket, put it in your purse. You no longer think about transacting. We're hoping that it sort of does to shopping what Uber and Lyft did to taxis. You're still transacting, but you don't think about that transaction moment anymore. You're just hitting a button. A car shows up, you get in, you get out. Jordan: It's just so seamless that now you take a Lyft more than you take a taxi, right? You're growing the pie and that's what we're really hoping to do with retail. Lukas: What do you do on day one when you decide to make a company that does this? What was the next step? Did you go talk to stores and try to get them to let you install cameras and run ML models? How does it work? Jordan: We did. Yeah, we did actually. Jordan: We actually had a pretty big co-founding team, but Michael — who's our chief business officer, was our ""business guy"" — me and Michael, we were in New York at the time and we hadn't quit our jobs yet, so we didn't have enough conviction, but that came shortly thereafter. Jordan: But yeah, we just walked around Williamsburg in Brooklyn and just started talking to store owners. It was Saturday...the very first thing we learned about retail was you don't bother retailers on a Saturday because that's the most important day of the week for them to sell stuff. We were just going in talking to store managers and they're like, ""This sounds cool, I guess, but you need to get out of my store right now. I got stuff to sell."" Jordan: So we started coming back on Mondays and Tuesdays. I mean, you go talk to retailers anywhere — even five years ago, six years ago — and it was already clear that everyone wanted this, right? We were super lucky. Once our name just got out, once we incorporated and put out even just a little bit of videos of what we were doing, we just got insane inbound from basically all of the retailers in the world. Jordan: It was small mom-and-pops all the way up to mega Fortune 10 companies. That's how we knew that if we can build this — at the time, it was not clear that we'd succeed — but if we could build this, then yes, there is this ridiculous amazing demand at the end of the rainbow. Lukas: And so where are you at? Can I go to a Standard AI store and pull stuff off the shelves? Jordan: Yeah, yeah, for sure. I mean, it is hard tech, right? We're five years in and we haven't deployed everywhere yet, but I like to call this space ""AV light"". You probably have a lot of AV folks on your store, so I'm going to be incessantly pinching them. Everyone in AV should come over to autonomous checkout, because the time is now. But we get to go to market faster. Jordan: Actually our tech is probably not as advanced as AV. I think we're pretty sophisticated, we do some really cool stuff, but we haven't invested quite as much as the Waymos of the world. But we get to go to market faster because we can make a mistake, and if so someone gets ketchup for free. Jordan: It's actually an okay experience for someone, and retailers are used to it as well. They have a built-in margin that they expect to lose because there's loss and theft and mistakes and breakage, et cetera. So it's just a really more friendly place to be. Jordan: We're just now kind of exiting MVP stage. We're at 10 stores now, that we've launched in with real retailers. They're just regular stores that we showed up and installed our cameras and transformed them. Jordan: One here in the Bay Area, actually at San Jose State University.We just launched about two months ago, which was super awesome. Because we have more adoption from that one store than our other 10 combined, because the students, they're like ""Yes, early adopters, great."" We have 500 people using our system every single day just at that one store, which is super exciting. Jordan: It's still early, but at the same time we're kind of exiting MVP and really starting to ramp up with our retail clients. They're finally seeing the tech work in their own stores or their competitor stores. And they're getting really excited about how much shoppers love this and also all these other value props that we have been pitching for the last five years around inventory, et cetera. They're finally seeing it now and we're growing into that and starting to expand. So that's really exciting. Lukas: In those five years that you've been working on it, what's unlocked the ability to put it into stores? Has it really been kind of making the models more accurate or something else? Jordan: It's a whole slew of things for sure, right? There's product work for sure. Because at the end of the day there is real experience and the way that you're presenting it to people in the store matters. Jordan: There was also just some go-to-market aspect of it too, right? Where when we started, we were like, ""We're just going to put this in every store in the world,"" which is our intent, but we were like, ""Let's go sign deals with everyone."" Jordan: We were going out and talking to 500,000 square foot stores, mega grocery stores. And then we had to kind of take a step back and say, ""Well, look, this is cutting edge tech. We need to start a little bit smaller. We can still partner with big companies and mom-and-pops, but let's go after convenience stores to start with."" Jordan: It's a smaller footprint. You don't need as many cameras, et cetera, smaller number of items. There was a little bit of kind of a reality check there that we should start a little bit smaller, which we did. So now we've kind of mastered convenience stores and are going to be expanding from there. Jordan: But yeah, for sure, the engineering, the machine learning, the operations. I think, for me, operations are always a super under-appreciated aspect of ML. You just got to go heavy on ops, and care a lot about your data, care a lot about your labels and your quality. That's been super important for us. Lukas: Interesting. So when you say ops, you mean labeling? That's the primary ops component? Jordan: Yeah. Tons of labeling for sure. I mean we definitely have big datasets, and we have a little bit of a HITL (human-in-the-loop) too, as part of our live system. Which is another kind of...I guess you see HITL on some AV systems as well, where there can be a disengagement and then a remote pilot will take over. Jordan: But for us that's actually a much easier part of the process, because you're not driving a car, so you don't need this 10-millisecond response time. We just need to get someone a receipt in the next 5 to 10 minutes. So if we kick off a background thread and have a human take a look at something, that's totally fine. And then that's another label that we can throw back into the system. So it's sort of all self-feeding. Lukas: When you set up a system in a new store, do you have to train it on the particular inventory in that store? Or I guess even inventory can change over time. Do you keep retraining your systems? Jordan: Yeah, for sure. I mean even the stuff that's not...so the items definitely change, the SKU set and the catalog change, but even the stuff that you would hope would be more generic is not quite as generic as you want. Jordan: Our people detection and people tracking systems are, in theory, fully generic. We show up at a store, we install cameras, we flip a switch, and we basically have a multi-view tracking system that can fully anonymously track 20, 30, 40 people within a space in real time, which is super, super cool. Jordan: But nonetheless, it does get better if you fine-tune on that store, right? So we'll go in, and over the course of the first month or two we'll label a little bit of data, fine-tune the model, and then redeploy to that store. And you do get a boost by doing that. I think at some point you probably start seeing diminishing returns. We're only at 10 stores, but presumably at 1,000 stores or 10,000 stores, that human model is going to be so general that there's probably no point in fine-tuning on a per store environment. Jordan: But when it comes to products, like you said, there could be a different product in every single store. That plateaus too. To give you some rough numbers, a C-store can have maybe 5,000 unique SKUs in their store and they're going to have maybe 30,000 unique SKUs across their fleet. But that fleet might be 1,000 stores. You get a pretty good economy of scales once you start getting to fleet-scale, because to go from one store to the full fleet, you're only going to 6x the size of your catalog, but you're going to 1000x the size of your deployments. Jordan: It pays off in the long run, but it's super expensive when you're only in 10 stores like we are. So we work really hard to stay on top of those ops of the churn of new SKUs that are showing up — it's the Easter version of the Snickers bar — it's just constantly churning for sure. Lukas: What do you do in that time when it's training, that month or two where people are coming in? Is it all sort of human-operated at that point, and then it gradually cedes to the ML algorithms? Or do people have some other mechanism for paying? Jordan: What's cool is we run in the background because we're showing up at existing stores. We're not building a new store, right? So the same store is there. We're not getting rid of the existing point-of-sale system, the existing checkout system. Jordan: We install our cameras, we're doing our things behind the scene, and then the store's just running as is. It's only when we're ready — and we've showed the retailer that we've reached a certain accuracy — that we flip the switch on. But then even then, once we flip the switch, it's not a hard crossover. Jordan: Which actually is really nice for us, because there's still people...Apple Pay has 6% adoption or something right now. You're not going to see an overnight, 100% adoption of Standard. Although at San Jose State, we do see that, we're basically at 100% adoption at that store. But- Lukas: -that's awesome. Jordan: -in most stores, you're not going to see 100% overnight. You're going to see like 5%, right? To start with, and it'll take time for everyone to switch over. Jordan: But what's cool about that is you get the point-of-sales signal. The point-of-sales system will tell you what the non-Standard shoppers are buying. Our system can still predict what we think they're buying. And then that's actually a nice corrective signal where we can say, ""Well, where are we making mistakes?"" Jordan: We have a team that will do deep dive analysis to sort of suss out what happened and then ultimately see if that needs to be a training label back into the system. That's a really nice flywheel because that's just running before and after we launch. Lukas: Totally, totally. Has your views on computer vision architecture changed since 2017? I feel like computer vision is constantly making advances. Does that affect you? Have you changed at all the way you've thought about training your models? Jordan: Yeah. We're still doing a lot of just old-fashioned — at this point — supervised learning. When we started Standard, I had a rule. At the time I was much more involved with the ML team, and had a rule with the ML team, which was, you're only allowed to do supervised learning. Jordan: Even five years ago, there was all this fancy stuff, right? Autoencoders and whatever, blah, blah, blah. It was all...I don't want to say BS, it was good research. But it was not production quality, industrial machine learning yet. But it was super attractive, people wanted to play around with those things. Jordan: That was my rule. ""It has to be just old-fashioned supervised learning. We're just going to throw a bunch of data at this thing. I'm sorry that's not glamorous. It's still going to be really hard, I promise you. You're going to have plenty of chances to solve hard problems."" And we did, we solved some really cool stuff. But that was kind of the rule back then. Jordan: I kept that rule for a long time and I think it's just now getting to a point where I think there's different ways. Of course supervised is still the mainstay, but I think synthetic data is getting super interesting. I think also just in the last 6 to 8 months, this self-supervised revolution that's happening in vision — that had already happened in NLP — is super fascinating. Jordan: We're starting to play around a little bit. It's not in our production models yet, but we're starting to play around with it a little bit. It's pretty wild what some of the stuff can do. Jordan: Actually, I had COVID about a month ago, so I had a few days where no one was letting me ""work."" So I was just programming instead. I was like, ""I'm just going to play around with some of this self-supervised stuff."" I took all of our images from all of our shelves from production, with no labels whatsoever. It was hundreds and hundreds of gigabytes of just images of products. Jordan: I trained one of these massive Vision Transformer Masked Autoencoders. I just let it run while I had COVID because it's about a week to recover, so I just let it train the whole time. Jordan: The things that were super striking about this was, first of all, it took me like four hours to do this. Shout out to Hugging Face and all these...5 years ago, even if I knew what the model architecture should have been 5 years ago, I would've spent a month programming this thing. Lukas: Totally. Jordan: Here it is, a couple hours punching around GitHub, tuning up a little bit of stuff, I spin up an instance on Google, and then I just let this thing run. Lukas: Wow. You just ran on one instance, it wasn't even distributed? Jordan: Yeah, I got the biggest instance I could. It was a 16-A100 instance. Which is something that only I get to rent. Hope no one from our company is listening and is like, ""Oh that means I get to go rent 16 A100 GPUs."" Jordan: But yeah, you didn't even need to...I think, the Vision Transformers are still not as big as the NLP Transformers, right? You don't need the...was it the PaLM model from Google, where they had two v4 TPU super pods. Like 5,000 TPUs or something, right? Lukas: Right. Jordan: For who knows how long. There's no vision models that are even close to that big right now, but maybe we should be starting to do that stuff. I don't know. Jordan: But anyway, just to wrap this, I know I'm going on super long. I trained this masked autoencoder and it's basically perfect. It's insane. You can mask out 95% of an image. In the paper they talk about doing 75% masking, but there's just such a clear signal from products — because CPG products have just such clear packaging — that you can mask out even more of the image and it'll reproduce super faithfully basically the whole package. Because it's able to learn what the packaging should look like, right? Jordan: If you think about it, we always talk about the manifold hypothesis where images sit on some sub-manifold, which is maybe or maybe not true, but it's definitely true for CPG. Because if you have a CPG product, the manifold is six degrees of freedom, right? It's rotation and translation with a little bit of lighting. But that's it, it's literally a very low-dimensional manifold. The model's just able to learn that on its own and it just completely faithfully replicates these. For me, it's just wild. Lukas: I've not heard of the manifold hypothesis. Could you describe that? It seems like it would be more than six dimensions of freedom for a packaged item. Jordan: Well, for other stuff it's way more. Actually, you can see it as it shows up...sorry, I've been eating snacks here. When you see how well it does on things like chips, it still does really well. But a bag of chips has more than six degrees of freedom, right? Because it's not just rotation and translation in three-space, which is six degrees of freedom. It's got all of this — sorry for the noise — all the crumpling, right? Jordan: There's actually a lot more degrees of freedom. But for something that's rigid, rigid body motion says it's just six degrees of freedom. X, Y, Z, and then yaw, roll, pitch. Is that what it is? And that's it, right? Compared to a fixed camera, there's only six degrees of freedom. If you take out the lighting aspect of it, which adds some additional degrees of freedom. Jordan: But yeah, the manifold hypothesis is that real, natural images live on these much smaller — and human has much more manifold dimensions to it — but these CPG packages are six dimensions. So you can learn it pretty quickly, apparently. Lukas: That's amazing that you're able to spend time training your own model. I'm jealous. Jordan: I'm jealous too, because it doesn't happen very often. Lukas: I'm curious. You guys have been Weights & Biases customers since the very early days, and I'm not here to advertise Weights & Biases, but I would love to know more about your stack. It sounds like you're playing with Hugging Face. Are you using that in production? What other tools are kind of important to you to make the whole system work? Jordan: Yeah, yeah, for sure. I'm super fascinated by this question and the new word ""MLOps"". I don't know when it showed up on the scene, but now everyone talks about MLOps. And we have the holy religious war around whether or not...I think it's very similar to the process that DevOps went through, where DevOps started as a methodology. It was sort of a practice and then it very quickly transformed into a role, right? Jordan: Where it was like, once you can enumerate the things that are in practice, then certain engineers don't want to do it anymore. So you want another engineer to do it for you and you give them that title. You're a DevOps engineer. That was against the whole purpose of what DevOps was about. But then same with MLOps, right? MLOps came around and it was like, this is a practice for how ML engineers should be doing their own day-to-day ML development, right? Lukas: Right. Jordan: For me, we call this end-to-end, full-cycle machine learning at Standard, which is how we tend to run things. You're responsible for thinking about your business impact, which starts with thinking about the metric that you care about. Jordan: I'm a big — I don't know what word I want to use — proponent of thinking about metrics. The easiest thing in the world is to look at a research paper and be like, ""Oh, so I'm going to use this mean average precision or whatever. That's what all the researchers are doing,"" but it's like, ""No, stop."" Jordan: The first thing you need to do is spend a couple weeks just thinking about your metric, because we're in production. We have real world use cases. And I guarantee you that the researcher that came up with mean average precision had no real use cases in mind. They just came up with it because they needed the number, and it is definitely not the thing you want to optimize for, right? Jordan: You need to think really hard about what your metric is, and validate that that is the right metric. For me, full-cycle ML is, ""Think about your business impact, your metric, your data. Get hands on with your data, get hands on with labeling, get hands on with model training and get hands on with deployment monitoring,"" and then what we call ""closing the loop"". Jordan: You need to have those tools that will meet you at the end and say, ""Actually your journey has just begun. Let's see how things are failing in production. Let's make sure that we're taking those as hard examples to bring back to the flock."" Lukas: Right, totally. Jordan: That's super exciting, because it's a whole discipline, but it's also exciting because it's still wild west in terms of ""What does the full stack need to look like?"" Weights & Biases is super cool. We're on GCP, so we're big Google users. I think they're innovating a lot in terms of what their AI stack looks like. Lukas: Oh, you use their AI stack? What's your favorite stuff? Jordan: I've never used any of it, I just know we use it. We use Vertex and...that's not true actually. I'm a big believer too in — sorry I say this a lot — AutoML where it's...I personally have definitely played around a lot with Google's AutoML. For me, it's another one of those places where, as an ML practitioner, you don't think to go to AutoML first. Jordan: You're like, ""AutoML was built for an old-fashioned engineer or someone with a business problem. They don't know how to do ML. So Google built this thing to make it easier for them to dip their feet."" It was like, ""No, no, no, no, no. Take a step back, first of all."" First of all, I'm sure whoever built AutoML put a thousand times more resources into this than then you're going to put into your custom ML model. Second of all, even if it's not better, it's a great baseline, right? Lukas: Totally. Jordan: Just do it, throw the data at it, get an AutoML baseline and then see if you can do better. Maybe you'll be surprised, maybe you can't, right? But your job is not to build models. Your job is to have a business impact. And if you can do that faster with AutoML or any other tools, just go for it. It's right there. Sorry, I'm preaching to some choir out there. Lukas: No, no. It's funny. We had Anthony Goldbloom on the podcast — the CEO and founder of Kaggle — and he was saying that he used Google's AutoML and it got him in the top 10th percentile on a Kaggle competition. Which I thought was amazing, it's like, ""Come on guys, use AutoML"". Jordan: Yeah. I mean, that's what these tools are for, right? Jordan: I think that's cool actually, because it's...again, for me it hearkens back to the previous wild west that we had in engineering, right? Where we used to write assembly code. That's what we all did. Not me personally but that's what we did back in the day, right? And then we started developing compilers and blah, blah, blah, and starting to move up the stack. Jordan: You had the same thing happen back then, where people were like, ""No, no, no. You can't use a compiler. You're never going to be able to write assembly the way that I can write assembly."" And sure enough compilers got way better than people and we kept moving up the stack of abstraction. Jordan: I think the same thing's going to happen with ML. We're not going to be sitting here tuning, manually writing, ""Layer Six goes into Layer Seven, and it's going to go from 128 features to 256 features."" That's not our future, I think, as MLEs. It's definitely many levels above that in abstraction. Lukas: You're one of the people that has big deep learning models as a really core part of your business and you're successfully deploying a lot of them and continuously improving them. So I'm sure people are going to be interested in more specifics around your stack. Could you share if you have a point of view on frameworks, do you use it all? Give me the stuff you like and don't like, I think that would be the most valuable thing you could offer our audience. Jordan: For sure. I pride myself throughout my career — even pre-ML — on picking the right horses, the right stacks that end up...even early, they end up playing out. Lukas: All right. So tell me about your 2017 stack then, because I know Weights & Biases is in there. You were one of our first customers. Jordan: So, that was great, but for sure the thing we didn't pick correctly...I picked TensorFlow at the time, and I think the whole world has revolted against TensorFlow. I think that the challenge is, you pick the wrong tech and then it gets steeped in your stack, right? It gets really hard to pull it out. We've switched over to PyTorch since then. Lukas: Okay. Wait, let's talk about that. Because everyone's got a different take on this. Why do you think PyTorch beat out TensorFlow? What do you think it was? Jordan: I mean, for ease of use is...it's the dev experience, all day. I don't necessarily think that it's technically superior. And I think Google's got a great contender now, I've been playing around with JAX in my spare time. We don't have anything in production in JAX, but we have a few irons in the fire, a few things that we're looking at and- Lukas: -nice. You're going to get a couple resumes from our community, I guarantee it. Jordan: JAX has got the same great dev experience. The ecosystem's a little bit more nascent, but that's to be expected, right? And I think it's more technically excellent. I think it's got way more head room to grow. And hopefully it's not going to be as painful of a shift to go from PyTorch to JAX. We'll see. Jordan: I think, as the deployment stories are maturing too, we're getting to this place where it doesn't really matter the way you train your model and the way you're iterating on your experimentation. The way you're going to production can be decoupled from that. So as long as you have the weights, then you can take it to production in a different way, potentially. Lukas: What about CI/CD, production monitoring? Do you use any of the stuff out there? Is it home-grown? How do you think about that? Jordan: That is a little past my...I'm not as in the weeds anymore, so I can't answer too much of that. I know we're always complaining about it internally, so I know we haven't settled on the right thing. Lukas: The 2017 stack though, so it's TensorFlow. What else? Jordan: We definitely built a company on Python to start. It was just a pragmatic choice, because ML is Python, unfortunately. I despise Python. Lukas: You despise Python. Wow. That's strong. Jordan: You got to come out of the gate strong with these. Lukas: I love it. Yeah. Tell me more. What do you like? Jordan: I mean, it's great as an...if I'm going to write a 50-line script, it's fantastic. If I'm going to write a 50,000-line script, and it's going to be sitting across 20 engineers, it's a disaster and- Lukas: -what do you want? What would you prefer? Jordan: This is the normal religious war, so I'll just harp on the same points, right? But for me, I like strong types because I think that they're...it's not even that you get faster speed — which you do, that's great — but it's a people contract. It's not a machine contract, it's a people contract and it enforces it. Jordan: So I know when I come to this piece of code — whether I wrote it 20 years ago or someone else wrote it yesterday — I know exactly what the output is, what the input needs to be. It's a contract. Whereas, with Python you still have contracts. You still spend a lot of engineering resources coming up with the right problem decoupling and ""Where should our API boundaries be?"" But it's never enforced. Jordan: So, is the contract that we all agreed upon actually what's happening or isn't it? And you just have to trust, or build like a ton of unit tests. But there's no guarantees. So for me, strong typing's all about building trust and being able to communicate better with other people. Because for me engineering, it's a team sport. It's a people sport, actually. It's very much collaborative and that's why I like strong types. Lukas: So favorite typed language is what? Jordan: This actually became part of our stack evolution, was we picked Rust actually. Lukas: Oh, you're going to get a lot of resumes. Okay. I was going to guess that. Jordan: We're definitely hiring plenty of Rust engineers right now. So if you like ML, and you like productionizing ML, and you like Rust, and you like streaming a lot of data, you definitely want to come to Standard. Jordan: We're still not 100% on Rust. We have some other stuff in our stack too, but we have a good healthy amount of Rust. Actually one of our early wins was — and this is why we ended up choosing Rust — was...one of our founders was a huge Rust proselytizer. Lukas: Wow. In 2017? Jordan: Yeah. Even years before that. This was Brandon, one of our co-founders. We were working together for years before that, and he was always pitching me on Rust. He's like, ""Jordan, Let's use Rust."" And my job was to say no. My job as engineering manager is to say no. Lukas: It's a funny story. I called the Streamlit founder — who was also on the podcast — after he sold his company for $800 million, I was like, ""Dude, what are you going to do?"" And he's like, ""Oh, I just want to write more Rust code. I feel like I now can finally do it, so."" You should send him your job opening. Jordan: That's cool. Well, if he doesn't have enough money and he just wants to come write some Rust code with us. Jordan: But yeah, it's a cool language, right? It took Brandon a while to...because he kept telling me what it was good at and what it wasn't good at. And I was like, ""Okay, using what you're telling me, the problem we're working on right now is not what it's for. So you just want to work on this because it's cool."" Jordan: But then we finally had this problem where it was the right thing. This was our multi-view tracking algorithm, which is this...we run deep learning models per camera to extract these features. And then you have to merge them together across all the different camera feeds in order to build up a single cohesive understanding of how people are moving through the entire store. Jordan: That part is not deep learning. It's just this super gnarly graph theory, combinatoric optimization problem. It's dealing with a ton of data, right? Doing a lot of heuristics and it has to be super fast. It has to be soft real-time because it's stream processing, maybe 100 cameras each at 30 FPSs. Jordan: We were building that algorithm and it was just getting wrecked. And then we were investing so much engineering resources into parallelizing the Python code. And that's when you have to take a step back from Python. When you start fighting the GIL — the global interpreter lock — and you're doing all this funky magic to get around that, and you start introducing the worst possible technical debt to paper over this fundamental limitation of Python, then it's really time to take a step back. Jordan: So we evaluated Rust. We're like, ""Let's see if we can rewrite this whole..."" And it was a big algorithm. ML's awesome because you write like 50 lines of ML and you get magic. This algorithm's like 10,000 lines of gross, nasty, massive heuristics. But we sat down and rewrote it in Rust pretty fast. And we got a 50x speed up. Lukas: Wow. Jordan: We've since then gotten like additional 2- or 3x speed up, because...the whole cool thing about Rust is fearless parallelism. I don't know if you know that expression for Rust. It's not just strongly typed. It's hella strongly typed. It has this ability to identify race conditions and make sure that when you're doing parallel programming, you're not going to shoot yourself in the foot. So you can more confidently move into multi-threading. Jordan: We've just gotten huge benefits by moving over to Rust, from a speed perspective and from a confidence perspective too. You have this super complex algorithm, you change something, and you want to push it to production. Jordan: The thing that we would've had to rewrite it in otherwise would've been C++. It scared the hell out of me, because C and C++ are just...I've had to deal with production code in those languages in the past. Memory leaks and segfaults...and you're having ML engineers write these algorithms, and they're not necessarily experts at memory management, right? Jordan: What's cool is now we have more research-oriented people that can make tweaks to this multi-view tracking code, and we don't have memory leaks in production. We don't have segfaults in production. They confidently make changes to the algorithm, push it to production, and it works. That's super cool. So I'm definitely a Rust proponent. Lukas: Any other early technical choices that you really feel proud of? Jordan: That's a good question. Jordan: I'll tell you one choice we made that was totally wrong. Which was, I had this belief at the time that ended up being correct, but it was still the wrong choice. The belief was that raw camera footage is better than decompressed video. Jordan: If you have a camera feed, the best thing you can possibly do is record those pixels raw to disk, train ML models off those raw pixels, and then deploy your model. Never allow H.264, H.265 compression to sit in between, because it's obviously throwing away information. That's its job, right? It's tuned for human fidelity, so that we can't tell the difference. But I was fanatical that we had to use raw only. Jordan: All this engineering work went into just being able to store all this raw data all the time. And it just got way too slow to maintain that engineering work. We finally ended up doing an experiment where we collected a bunch of video data, and we labeled it both from video and from raw, and then we trained the models. Jordan: Sure enough, there was a pretty sizeable gap between how much accuracy you could get — I don't remember exactly what it was, but it was meaningful — but we sat down and we were just like, ""It doesn't make sense. Sure, that accuracy matters, but we won't have a company if we don't move up to video."" Jordan: I think the scary thing is, to this day, we still have a little bit of vestiges of working off of images instead of video. You make these decisions early on, and they're weeds that are so hard to pull out. We're all still paying for my sins. Jordan: That's the dangers of making some of those bets. But I think we've made good bets as well, in the past. Lukas: Where do you store all your data and how do you retrieve it when you want to train on it? Jordan: We started off as on-prem stack. We bought GPUs, and we built machines, and we put them into convenience stores. That was wild, because we had to upgrade the HVAC in order to make sure the convenience store didn't melt down. Jordan: Now we run everything in the cloud. We stream everything to the cloud, which is great for iterations. You just have access to whatever you need. I mean, we have retention policies, et cetera, obviously. But I think moving forward — probably in the next year or two — I suspect we'll be taking some of it back on-prem. Jordan: Mostly just from a cost calculus perspective. Because the cloud's great — it's super flexible — but it's not necessarily the most economical. Especially when you're talking about renting GPUs, which is still an arm and a leg up on the cloud. Lukas: Do you worry at all that the problem you're solving as ML gets better and better might get too easy and would no longer be a deep technical problem? Jordan: I don't worry about it. I know that it's going to happen. Jordan: We talked about this even early days of Standard. Back then we said 10 years from now, it's going to be a ""git clone"" to do autonomous checkout. Or worst case, a four-hour project, right? An undergrad's going to be doing it over the weekend or something. Jordan: We knew that was going to happen. And I think we're seeing the progress too, right? Even the story I told you about the masked auto encoders, that's such a...and there's real applications to those too. You use that as a pre-training step, and you get better accuracy on item classification. Better than purely supervised, right? It's just this crazy bump in your ability, nd it took us couple hours to do it now that it's just a ""git clone"". Jordan: So it's definitely happening. I think we still have a few years left before it's a ""git clone"". I would still guess like four years maybe, four or five years. Lukas: Just four years, wow. Jordan: Four or five years. Things are moving fast, man. It's crazy. Lukas: Wow. That's crazy. Jordan: But what we told ourselves five years ago was, ""Yes, that's going to happen, but the same is true for any industry."" A point-of-sales system is...I like to talk about this a lot too. A barcode scanner hooked up to a point-of-sale system was literally state-of-the-art physics 50 or 60 years ago, right? Jordan: It's a laser. We didn't even know lasers were physically possible. And then we hypothesized the physics, we validated the physics, we productionized the technology, and now it's so ruggedized that it's in every single store in the world, right? And you don't even think about it as technology. Jordan: So yeah, that's going to happen. That's okay, I'm sure we'll have other cool hard problems to solve in 10 years. But what we need to do is transition this tech lead that we have into a sort of a true moat, a true flywheel. Jordan: And I think that's making ourselves indispensable to retailers, just providing them so much value that...sure someone else could come along and ""git clone"" autonomous checkout, but our customer support is amazing. Our product is super refined. The experience is amazing. We've got 30 other amazing features that sit on top of the stack that's invaluable to the retailer. The shopper has come to depend on this because it's Standard in their pocket and they expect people to walk into a store and just have Standard work for them. Jordan: I think you have to use this tech advantage to turn it into the normal types of advantages that regular startups are using to build a moat. That's okay. I think that happens to every hard tech company Lukas: Or they don't make what they set out to make. I think that might a common failure, but- Jordan: -yeah. For sure, for sure. Lukas: What's one non-obvious thing that you could do to enhance the experience? I'm sure you've thought about this a lot. Jordan: There is still this friction in the experience, which is...our visual system is fully anonymous. We don't know who you are. You're Person 17 when you walk into the store, and that's intentional. We don't do facial recognition, et cetera. But we have to tie your payment information to Person 17 somehow. Jordan: If you've been to Amazon Go, they do these gates. When you walk up to the store, to get in you have to pass through a gate literally, and you use the Amazon app to open the gate. There's a visual sync, basically, where behind the scenes, Amazon's saying, ""Okay, Susan just badged in. We see Person 17 at the gate. So Person 17 must be Susan."" You do that, we call it association. Jordan: We do something very similar. We don't do gates, because we believe gates are antithetical to good retail. You don't put friction at the beginning, you put friction at the end. Amazon knows that too, so I don't know why they...they're the best e-commerce player in the world, they know that you put friction at the end. You never put it at the beginning. Sorry, just a tirade. Jordan: We're strong believers that you don't put gates up. What we do is we put NFC stickers in the store. What you do is the same thing. Anytime during your trip — you don't need to do it to get into the store — you can just come shop. Anytime during your trip, you take your phone out, you bump one of these NFC stickers. And then we do the same thing, where we know when the bump happens on the backend; and we know that that's Susan; and then we know Person 17 was the one bumping because we have this fine-grained 3D reconstruction of Susan as Person 17, so we know where their hand is; and then we do the association. Jordan: But there's still that friction, right? You have to take your phone out of your pocket and think about transacting. I have this belief that we'll be able to get rid of that at some point in the next couple years, where — without being privacy invasive — you can keep your phone in your pocket and using additional signals like Bluetooth wqe should be able to narrow in and figure out Person 17 is Susan. Because then you can really just walk into a store, and walk out, and never have to think about transacting. Lukas: That's cool. One thing I want to make sure I asked you about is your StandardSim data set. Could you maybe describe what that is and why you released a public dataset? Jordan: Yeah. This was super cool. It's a 3D sim, basically, of stores. It builds 3D...not 3D reconstructions, but it builds 3D models of stores totally synthetically. Where are the shelves, where are the cameras, where are the products in the shelves; it tries to simulate the way that the products are stocked in the shelves. It has a decent corpus of SKUs, et cetera. Jordan: It's just a way to build up these 3D representations of stores. Obviously what's cool about that is you can generate infinite image data of stores, synthetic stores. That's a huge leg up to build, to move quickly, and get off the ground, and start training. Jordan: We often see a lot of models where if you train on synthetic, it doesn't give you as good results as if you train on real data. That's definitely true still for some of our models, but there are some models where you just can't get the data labels in particular, in the real world, right? Or you can, but it's just insanely expensive. Segmentation is a good example where it's just so expensive to do segmentation. Jordan: For us, actually, we were working on this model called change detection. Which is, if you look at a shelf over time, you can see the item sort of be taken and removed. That's a really interesting...we can create that dataset, the real data set, but how do you label it? Asking a human to look at a before and after image and draw a segmentation mask of where the item was staged, it's not an easy thing to do. Jordan: But with the synthetic data, you can just simulate it and get a billion images of before and after with perfect segmentation masks. So that was the original inspiration for creating that data set. And then I think we're all big proponents of open source and I think open source data is sort of the next version of that. If ML's going to revolutionize the world — which it is — we have to make that more democratic. The code is becoming super democratic, the data is not, right? Jordan: I think that's sort of an interesting gap. I'm not exactly sure how to fully close that gap, but I think that open sourcing this synthetic data set at the very least is a cool way to help. Lukas: Interesting. But this seems very core to your business. You weren't worried that a competitor might use this dataset to build a competitive algorithm? Jordan: My opinion is there's some great teams working on checkout. Obviously, I think highly of Standard, but there's a couple other great teams out there. Jordan: They're doing this their own way. It would be sort of like Cruise using a synthetic generator from Waymo or vice versa. It could happen, sure. If they do, great, best of luck to you. But I assume they've got their own stuff that they're doing. And switching costs are so high that they're so deeply invested in whatever synthetic thing they've got, or XYZ that they're doing, it's just going to be too expensive for them to switch. Jordan: I think really the value of these open source initiatives is for the broader community, so that people can get their hands on this, play around with it, and come up with some other really cool application. And show us what's possible, right? We're so tunnel visioned on trying to build this one thing that we're trying to get out. Maybe there's some other cool stuff that you can do with this. Lukas: Have you seen any interesting applications yet? Jordan: Not yet, but hopefully someone who wants to come do Rust machine learning will ""git clone"". Lukas: Don't forget about JAX. Jordan: JAX, yeah. ""git clone"", do something cool with it, and then start. Lukas: And Weights & Biases, don't forget. Jordan: Yes, yes, exactly. Lukas: Do you have any kind of benchmark for accuracy on this dataset? Do you think about it like that at all? Jordan: We do, yeah. I mean, it's going to be similar...I don't know exactly what it is, we can follow up with the folks that built that dataset, but it's something more similar like intersection over union, right? It's, ""How close are you getting that segmentation mask, basically, to the ground truth?"" Lukas: Right, right. Lukas: All right. Well, we always end with two questions that I want to make sure I ask you. One question is, what's an underrated topic anywhere in ML that, if you had extra time, you'd love to look into or study? Jordan: We touched on a lot of them. I guess some of them are underrated, right? Jordan: I'm a huge believer in tooling. You got to pick the right tools and you got to keep pushing the tools forward. Huge believer in ops, whether it's labeling or having some human-in-the-loop component, you've got to invest world-class ops. Those are the unsung heroes in the world. Everyone wants to be an MLE, but the operators are amazing folks who really make this possible. Jordan: In terms of more like research-y in the ML world, maybe this isn't a hot take, but I'm still a big believer in symbolic reasoning. Maybe I'm just one of those old foggies that is going to die on a hill, but it's just so clear to me that the way our brains work is partially symbolic, right? Not fully. Obviously, you get some stroke of intuition, et cetera, for the way we do item classification. Jordan: It's like, who knows? It's literally a deep network of real neurons. But it's so clear that when I'm introspecting the way that my brain works for something slightly higher and more abstract, that it's doing something more symbolic and is really kind of thinking through the sort of graph structure of the problem and breaking it down, exploring different aspects of the tree. Jordan: I think there's got to be some way to merge it together, right? So if I had just made $800 million, I would be using Rust to solve how to bring symbolic logic and mega Transformer models together to rule the world and solve world hunger. Lukas: Wow, I kind of hope there's an exit in your future. Lukas: I guess the last question is, what's the hardest part about making machine learning work? In your case, maybe I would say what's been the most surprisingly difficult part of going from these image models to a working system in production that people can actually use to purchase stuff? Jordan: So many things, but the world is messy. It's super messy, right? And then in this case, I literally mean messy. Because stores are chaotic places, right? There's thousands and thousands of items, and most of them aren't in the place that the retailer wants them to be. Jordan: They have these meticulous plans that they invest in called planograms, where they optimize where all the products should go. And the CPGs are investing too because they're trying to sell you more Snickers. I don't know why I keep using Snickers. They have this plan, but then show up at a C-store, show up at a grocery store, and stuff's everywhere, right? Jordan: There's people unpacking boxes, and misplaced items, and there's just random stuff on the floor. They try really hard to keep the store clean, obviously, but it's just a pretty chaotic place. Retail is chaotic, right? You've got thousands of people coming through the store every day. It's going to get messy. Jordan: That's challenging. It's a really dynamic visual dataset. And just random stuff happens, right? In the AV world, they talk about the long tail distribution of reality, but yeah, we see that. Lukas: All right. Give me some long-tail cases. I love these. Jordan: One of our stores, we had a Listeria outbreak, so they had to throw away all the fresh foods. In retail, it's called selling air. You can't sell air, so they had to put something on the shelves, but they didn't have any fresh foods. So that store manager...and store managers are typically super empowered in retail. There's these massive companies, but store managers actually get to have a lot of say, because they're the ones that are trying to sell stuff, right? And they know the local clientele, et cetera. Jordan: So that local store manager was like, ""Well, I'm just going to go get fresh food. I need sandwiches. I'm going to go get sandwiches."" They went and got new sandwiches same day, brought them back, stocked their shelves. And now suddenly from a computer vision perspective, you're like, ""Well, we've never seen these products before. We don't know what the barcodes are. We have no data set for this."" But the store manager's like, ""I need to turn around and start selling this stuff right now,"" right? Jordan: We were able to turn that around, and start selling it pretty quickly, but that's super hard. And again, it's this really rich intersection of engineering, ML, and operations, and client support too. You have to bring all those things together. This is not just ML. Jordan: That's a lesson that we've learned over and over again. Every piece of this has some connection to the shopper, to the retailer, to us as a business, and you have to bring all the stakeholders together. We're a super cross-functional team, and we love coming together and looking at all the different sides of the problem to ultimately make something that we can put out into the real world. Lukas: Awesome. Well, thanks so much for your time, Jordan. That was super fun, super informative. Jordan: Yeah. This was awesome. Lukas: Thank you. Jordan: Awesome. Yeah. Thanks for having me on. Lukas: My pleasure. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out.",10501 +"Drago Anguelov — Robustness, Safety, and Scalability at Waymo",https://www.youtube.com/watch?v=5qpwafctMUw,4141,2022-07-14,"Drago: I think that simulator has this huge scaling promise. You take any scenario you saw, you release the agents and you release yourself, and you can try all kinds of stuff and they can try all kinds of stuff. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Today, I'm talking with an old friend, Drago Anguelov, who is currently a Distinguished Scientist and Head of Research at Waymo. He's been working on image models for at least the past 20 years and was another student of Daphne Koller, who was also on this podcast. This is a super fun conversation. I hope you enjoy it as much as I did. Lukas: My first question is something I hadn't realized, which is that you were one of the authors of the original Inception architecture. Drago: That's right. Lukas: I should know but I somehow missed that. Can you tell me how that came about and what you were thinking about at the time? Drago: Actually, the story goes back even before. At Google, I worked on Street View for a bit which was related to autonomous driving and one of the areas where, actually, computer vision used to work pretty well. There were two, face detection and license plates blurring. That worked pretty well at the time. The other thing that worked really well is 3D reconstruction from cameras, from LiDAR. That's what I used to work on. Bundle adjustment and so on. Lukas: Were those deep learning models at the time? Drago: None of them were deep learning. Actually, we had one person in 2008 or '09. He came from Microsoft. I think his name was Ahmed or Abdul. He used deep nets to essentially detect and blur license plates. Everyone was very unhappy that he used deep nets, because it was his own code base and no one else was doing anything like it. Of course, you could modernize and upgrade it by doing support vector machines. Lukas: Right. Drago: Eventually, people tried to modernize and upgrade with support vector machines the neural net things and they didn't quite succeed. I think they regressed a bit but everyone used technology they understood. I didn't work exactly on that problem but I think at the time, that's how we used to do it. That was in 2009, maybe in '10, right? And after working in this field, I decided that maybe I should do something more adventurous in my career and join a team in Los Angeles that essentially was called Google Goggles. It was not the glasses. It was a little app that did computer vision and we used to use it for experimental computer vision tasks. There, we started experimenting with different applications of learning and deep learning to computer vision. How can we recognize these objects in these pictures? At the time, there was a time when... I was a tech lead manager of a small team. There were four of us. Half of us did graphical models, deformable parts models. You may be familiar when I was a student of Daphne Koller, we did a lot of those. Then half of us, the other half, we were experimenting with deep learning. That was Christian Szegedy and Dumitriu Erhan. In those early days, the deep learning models at Google were something that was called QuocNet, which is a non-convolutional, big neural network that Quoc, a step-student of Andrew Ng, brought. For a while, we were trouncing it with deformable parts models and I was working with an intern who later became a Googler. We had the best deformable parts type detector. Even collaborated with Professor Deva Ramanan. He was also in the LA area at the time. We had the collaboration, built something nice. That's where Christian Szegedy came in. He was actually on my team. For a while, the deformable parts were beating the deep nets. But then eventually, AlexNet came in and then all of a sudden, no custom solution could beat the deep nets and so we switched to this. But we were early on this already. We had people that had been doing this for a while. So two interesting things happened. We started optimizing the architectures because, actually, in Google, that was the easiest thing to do. Because the system for training them, it was called DistBelief. It was pretty unwieldy, and so you couldn't be too smart. The easiest thing to do is to just tweak the architecture. We're tweaking it and Christian, one day, comes to me and says, ""Hey, Drago. I have this idea. It's a Hebbian-inspired idea. I'm going to train this new architecture."" I was like, ""Oh, Christian. Very nice."" I mean, we had been playing too. I had some versions of architecture that was 1% or 2% better or something. ""What part did you change?"" It's like, ""I'm going to change everything."" I was like, ""Oh, that's a great engineering approach. Aren't you worried that...who knows what will happen?"" ""No, I have a good intuition. I'm already training some. It's doing great."" A bit later, he's like, ""Look at this thing!"" It beats anything we've ever seen. And that's when we decided to do also the ImageNet challenge. We had this and some detector work as well. SSD came out of it. That's also a very strong contribution by Christian. But he was bold and he decided to just try more ambitious things. And in these early competitions, people still tried to do a lot of smart things in the old style. They tried to embed non-algorithms in the networks instead of making the networks better. And we, for good or bad, were in the environment where the easiest thing we could do is make the networks better. That's what we did and I think that really helped early on. Lukas: What was the intuition that he had to...what was the tweak that really made a difference? Drago: I mean, I think there is...if you remember the Inception architecture, it had this...in each module, there were several paths. One path was doing 1x1 convolution, mostly just adding depth processing. Then it was 3x3 and 5x5 convolutions, and those were adding...expanding the receptive fields, right? Then you had a separate channel for each which kept the model still tractable, so it's not like number of inputs than number of outputs. So you had some...it's something like block diagonal, not quite structure because I had three channels and then you, again, condensed the information from those. That was the idea for a block. It's a nice compact way to still add a lot of rich structure in depth. That was, I think, very powerful. I think if you ask Christian, he'll give you a whole other story why he came up with this model. I'm not sure I'm the best person to channel it. You should invite him to...I mean, he did a lot of these early visionary things. But we all worked on it together and that's how I was part of it. The other thing we...actually, again, Christian was very involved. We discovered...this was 2013 or so. Deep nets, we used a lot for classification but not many had used them for detection. Hartmut Neven, who was our director, came and said, ""Hey, Christian. I have this idea."" I think some of it came from Christian. ""Let's make a better detector. We just backpropagate the signal through the network and see which parts of the image caused it to fire that it's a cat. If you do this, you will find where the cat was because the network will highlight you the cat."" We're like, ""Oh, that's a cool way to do object detector. We don't need to...we can mostly use a classifier. Let's try it."" We tried a few versions but Christian tried it and it's like, ""It doesn't work."" And it's like, ""Why doesn't it work?"" ""Well, the image doesn't change. Now, it says 'It's not a cat.' It's whatever, 'giraffe.'"" I mean, you name it. We're like, ""That's strange."" He debugged for a long time. I mean, it's also that at the time, the system was complicated. The written was not easy to debug. Maybe two months, he debugged, including trying on MNIST. On MNIST you could do the same. Then eventually we realized, ""Okay, something is happening here. There's these adversarial examples. You can just flip the label without much visible...any changes in the image."" But we started off to discover a detector and then ultimately, he ended up with the paper. They bunched several discoveries in that paper but by far, the primary one was the adversarial examples. Lukas: That must have been an exciting moment. Did it feel like these image tasks were getting better much faster at that time or did it feel like a gradual change? Drago: I mean, it was a very exciting time, right? When a new set of...a whole new field opens in front of you. Let's try to do computer vision with deep nets and most of it hasn't been done. And most people are not doing it either, right? I mean, there were a lot of developments at that time. Every few months, something pretty major happened. I mean strangely enough, this continues. If you caught a bunch of people in 2015, '16, and say, ""Okay. What's left in computer vision, in 2D computer vision? How much should we do?"" We're like, ""We're pretty good already."" I mean, that's why I went to self-driving. I was like, ""Okay, 2D computer vision on images is pretty good now, in 2015. Let's do cars. That's a whole other game."" But now, early on, there were a lot of big developments. Batch normalization came out, Sergey Ioffe, and again, Christian Szegedy were involved. That's down the line. I mean, in Google Brain, people did a lot of really cool things. It was just like one after the other, there was a group of people. That was also a time when a lot of academics came to Google to do deep learning, right? Later, a lot of them went back to academia, they realized they can still do it there. For a while, it's like, ""We just need to do it in the big companies."" There was a bit of this. At least that was my exposure to it. Maybe, people have different interpretations. It went back and forth. Now with what people call foundation models and the big transformer language models, people say, ""Maybe we should be in the industry again,"" right? But there was a time when people could go back to academia and not feel too deprived. Lukas: I think there have been four versions of Inception, right? Are people still working on improving these architectures or does it feel like we've squeezed out all the improvements from that? Drago: I mean, it's moved on. There is actually a guy on the Waymo research team called Mingxing Tan, who worked with Quoc Le, of the famous QuocNet I described. That was not convolutional. Hey Quoc, I don't mean anything bad. These folks are doing great work. I think Christian just moved away from trying to improve the architecture. So, there were Inception and then there were... I think, Francois Chollet — who was also was briefly on the team I led at Google that was still '15, who did Keras — he had, I think, Xception or Nextception, another variant he simplified. Lukas: Yeah, that's in the Keras library for sure. Drago: So, he developed that. I think afterwards, now, people moved to the large transformer models, right? XCiT, and Swin Transformers, and Google. Minxing Tan, some of his work. There's a model called CodeNet. In our times, we top-1 on ImageNet. We would get maybe 70% accuracy. Of course, top-5, it's a lot better. People used to score at top-5. Now, people can get, I think, 90% top-1, with a lot of pre-training on large datasets and other things. But this CodeNet is a hybrid convnet transformer and it's dramatically bigger than the models we used to train, and it's pre-trained on a lot more stuff, potentially. But people have pushed what's possible in ImageNet a lot further. Now, I'm not sure how much yet further you can push it, given the inherent limitation of the dataset. Lukas: Right. Drago: But people are very good at ImageNet with different technology than what we used to do. Lukas: What's the inner-annotator agreement on ImageNet? How well do humans do ImageNet? Drago: I have no idea. I mean in the old days, Andrej Karpathy, he did a test where he tried to label the test set after training himself. And I think the models were competitive with him. By now, I think they blow humans out of the water. Speaking of Andrej Karpathy, actually, it's a small world but in 2012, when we were doing deformable models and deep nets — actually I was going there with the story and then we went to other directions — I had the chance to pick either Andrej as an intern to do deep learning, or a guy called Xiangxin Zhu from Deva Ramanan's lab to do deformable models. And I picked Xiangxin Zhu, right? So, I never got to work with Andrej. Maybe to my peril, but yes. Lukas: I remember I interviewed with you to be your research assistant as a master's student and you chose Jimmy Pang, who is a very talented guy. Drago: Oh, my. He's a very talented guy. I'm sorry. Don't hold it against me. Lukas: No, he's a good choice. I can't hold that one against you. Drago: Hopefully, it worked out for all of us. Lukas: It worked out for everyone. I'm really curious about...you've been in autonomous vehicles for quite a while. From the outside, it feels like autonomous vehicles are steadily improving. Sorta feels inevitable to me, I guess? But so hard to tell when I'll really be able to just purchase an autonomous vehicle and ride it. I'm kind of curious, what... I mean, I'm really curious about what your thoughts are where things go. But have there been major breakthroughs in autonomous vehicles in the last 10 years that you've been working on them or has it been really an iterative process? Drago: It's an interesting domain because...I wasn't at Waymo early on, but people that were at Waymo were very proud of the demos they could do even 10 years ago or 12 years ago, right? Waymo is 13 years old, we've worked on the problem a long time. As everyone also understands, it's the very interactive, rare cases that you need to be robust — and all the possible failures that you need to be robust — that makes it so hard. A lot of these improvements are not so easily perceptible. In the early times when you sit on the vehicle, it feels pretty good but you need sometimes dramatic improvements under the hood to make sure that it's really pretty good and comparable to humans. I mean, humans ultimately are pretty good at driving, all things considered. Especially when they pay attention, right? Which of course, that's one big advantage of autonomous vehicles. They always pay attention. I would say that over the last 10 years...and I'm happy to be part of the process. Just in computer vision but here even more, obviously, I think the entire technology is being rethought ground up. I think machine learning takes constantly more prominent roles and the types of machine learning and the models we do continue improving at a fast pace. So, there is a lot of capabilities. And I think you can see that, for a while, maybe there were no notable launches. Even though you would hear about the space. Now, people start launching things, right? I mean, Waymo launched the first fully driverless service in Phoenix in 2020. In public, I think we've driven over half a million miles in autonomous mode. In San Francisco, we started driverless operations. That's another big milestone. I mean, we're building a stack that can handle car, truck, highway, domain cities, but it's one driver still. But these are deployments we're having. We announced we will launch downtown Phoenix. I was on the car actually in San Francisco, in driverless operation, maybe 10 days ago. It's awesome, right? I think when you start seeing milestones like this, they're meaningful. Now, the truly meaningful milestones is when you release it at large scope and scale, right? You want to do it in a thoughtful manner, make sure that you're confident when you put these things out there, that they interact well with people and are safe for everybody. Lukas: When you say that autonomous vehicles have been really redesigned and machine learning takes a more prominent role, could you give me a flavor of what the trends are like? Are things moving to a more end-to-end system where the inputs are like a camera and the output is which way to turn the steering wheel? Or are things becoming more componentized, where each piece is responsible for something? What are the big trends over the last 10 years? Drago: The main trend, I would say...and I've tried as a leader of the research team at Waymo, which does pretty much primarily ML, almost exclusively, right? But we started applying to perception and then prediction and understanding behavior, and then planning, and then in the simulation, right? I think it's permeated every aspect of the system. Onboard, offboard. There's machine learning in all these components. Major models, meaning they're not just small features, they're core parts of the system. On a macro level, that's a change that's definitely happened at Waymo. I think when people started early on in 2009, there was a famous Sebastian Thrun book, ""Probabilistic Robotics"". There, you have the LiDAR. You can create all these segments out of the LiDAR, then you can reason about the segments you can build. Initially, people — without the deep learning models — would build a very modular stack, with very many modules. Each does a little something. You put them all together. It's a significant engineering challenge. The trend has been larger and larger neural nets, right? Doing more and more, potentially going from neural nets in narrow scope to neural nets in wider scopes. Maybe narrow nets from one task to neural nets to do multiple tasks. The trend is for the modules — with the help of machine learning models — to grow larger and fewer. Now, there is an interesting question, and this is an area of exciting research. Not only, I think, in the industry. Some companies espouse a fully end-to-end learned approach. There is no clarity if a fully end-to-end learned system is actually better. There is, in life...when you build these things, there is often trade-off between different extremes, right? Each of these things has its pluses and its minuses, and you want somehow to take advantage of the pluses but not to be stung too much by the minuses. We were maybe too much on the end of too modular system, too many small pieces written by engineers. Whether the answer is several large modules or a single end-to-end thing, I think it's an open question. I think this is an area that we are still..as a research team, we're exploring the repercussions of these things. The industry is exploring because people have different vision for some of these things, right? But I don't think, I would not say...there's some serious trade-offs to doing everything end-to-end. Not in academia, by the way. If you take an academic dataset and you train more end-to-end in the small scope, you will probably do better. But that does not mean you build a better system, in the production setting. Lukas: Why is that? What are the pitfalls? Drago: I think ultimately, in an academic setting, you look for a lot more average metrics and things, right? And the dataset is small. Clearly, if you build something that incorporates everything and co-trains[?], it will probably do better, especially if you optimize on it. In the production setting, you're looking to be robust to the very rare cases. You look at speed of iteration and ability for people to fix your model if there are issues, right? You look at the stability. Like, understanding there are issues, being able to dig in. Simulation requirements too. If you have a fully end-to-end model, now you need to...your simulation has to be end-to-end and you need to simulate all the sensors as needed and so on. That's maybe a lot more expensive than some intermediate representation that may be simpler to simulate. Maybe it does not pass all the information the model may want to pass, but at the same time, you'll get other benefits. Maybe you can train closed loop a lot faster. That, now, also can help you. There's very interesting trade-offs in this space. Lukas: I think one thing that I've really noticed from my vantage point of selling tooling to autonomous vehicles is how many customers there are for labeled data and Weights & Biases stuff. Do you think it's a wasted effort that so many different smart teams are tackling the same problem? Do you feel like there's a diversity of approaches that's interesting? Does Waymo have a specific point of view that is different than Zoox or other places that you know of? Drago: I think if you start looking at the stack, first of all, you don't really know. I don't know exactly what the stacks of the other companies are, right? They're proprietary. I think there is an interesting search space where you are saying, ""I'm going to design this system. It will have these APIs. These are the intermediate outputs. This is how I build my tooling. This is how I understand how the system is doing. This is how it iterates on each of them. That's why this representation is beneficial for onboard perception, onboard performance, for example in simulation."" Right? You take all this into consideration. It's a very wide search space. I think every company ends up with somewhat different APIs, design choices, trade-offs, and how modular versus not trade-offs and how much machine learning they put versus not. This is very understudied because ultimately, it's hard. I think in research, it's a lot easier to study every one problem in isolation. You can say, ""Okay, let's do 3D flow prediction."" We have some state-of-the-art 3D flow prediction or monocular depth others do. I think when you start combining them, there's actually a lot of variability possible and people's stacks end up quite different in the end. Even if on the high level, you can say they're somewhat in a similar way. Lukas: Interesting. Do you feel like there's still deep problems to solve between now and everyone riding in autonomous vehicles? Drago: Scalability is always a problem, right? Safely, cheaply scaling to...I always think what system do I need to build to scale to a dozen cities, cheap. Lukas: But wait, I want to understand that. Because with a normal piece of software, you wouldn't only deploy it in San Francisco and Phoenix. If it's safe in one city, it would be safe in every location. What makes it hard to go to LA and have the same thing deployed or go to Boston and deploy? Why does it have to be once at a time versus...usually, software goes everywhere at once. Drago: I think we're at the point where we're building software that should deploy to most places and be pretty good at it. Maybe historically with the probabilistic robotic stuff, that wasn't quite the case, right? Still, you need to validate that and be sure. I think there is a lot of local particularities in every location that you should make sure you can handle. There's some strange roundabout in this place, and there is some eight-way intersection in this other place, and maybe here in Pittsburgh, they do the Pittsburgh left that you need to understand, right? Someone needs to go out still, collect data from these places, potentially tune the models on these places, potentially then do the safety analysis to convince yourself you actually should deploy. And that is work. That's why you don't just build it once and, ""Okay, let's just drive and see what happens."" I mean, you can do that but if you actually remove the driver, not sure how responsible that it is. Lukas: But I mean, Google has actually mapped every city on the planet, it feels like. Shouldn't it be possible to send some cars and collect data? What... Drago: I mean, we are sending cars in collecting data, right? So we are growing our scope. Now, we have Chandler and Phoenix. Now San Francisco, we announced. Downtown Phoenix, we announced that we will deploy starting this year. We're collecting data if there's been public postings in New York, say, in the winter and in Los Angeles. And of course, we now have trucks collecting highway data for trucks and behavior around trucks which is important, I think. There's somewhat different behaviors around trucks and different issues — like seeing around the trailer, for example — that you don't have with a car and you have a somewhat different sense of configuration. We're broadening, right? I think the car is...and I agree with you. If you look at every city as one deployment and one piece of software and you just develop it and launch it in that city, that doesn't scale, right? The way we think of it, we're building one driver, if possible. This driver is able to handle all these environments as possible, including potentially as much as possible, cars and trucks. Even though there will be some small differences. But the core of all pieces is similar in the nature of what they want to do. Then you iterate and when you're comfortable that you have enough data about safety and passing the bar, you launch. Lukas: Do you have an opinion on LiDAR versus vision-only approaches? Drago: By the way, maybe one more topic on the previous question. I think if you ask what are the big open questions, I think one of the interesting topics is — and this is a scaling factor for you — you want more machine learning in the planner, if possible. And you want a realistic simulation environment where you can just replay full system scenarios, and without too much human involvement determine whether you're improving or not improving. For us, the big challenge is, it's a very complex endeavor. It's not like someone gave you the perfect simulator for autonomous vehicles. You need to build one. And ideally, you build one from the sensor data and the data you collected. So that's like real-to-sim. And now, by what metrics do you build the simulator? You need to establish metrics for the simulator that constitute acceptable simulation and for our simulation, a lot of it is about behavior of agents. It's not just how something looks, even though we like that work, too. We've done NeRF and the 3D reconstruction and all kinds of things. But ultimately, the behavior is some of the main things you need to solve, so you need realistic behavior in the simulator. Then when you have that, then you have the other metrics which says, ""What does it mean to drive well in the simulator, in the world?"" You need both. You need to build both things. The further you go, the easier it is to improve these pieces because the less you need humans in the loop, right? We can still improve them. You don't need perfectly realistic simulator to improve your driving. It's just that it requires more human judgment, right? But there is a process to... I mean in these areas, a lot is possible still, right? And we hopefully will show more interesting work this year. We sent a couple of papers in the space that people may find interesting. Lukas: Cool. That might even be applicable to things outside of autonomous vehicles, right? It seems like sim-to-real type of stuff is necessary for any kind of robotics application. Drago: Yeah, real-to-sim in some sense. I mean, the specific instantiation is maybe a little different, but I would say that one of the nice properties of AV is that it is a complete robotics problem. Maybe it's a specific kind but a lot of the things you need to solve for other robotics problems are, at least in some shape, covered. There will be hopefully a lot of positive spillover from our domain to others. Lukas: A couple of practical questions that everyone on the team wanted me to ask you, if you could talk about it. Do you have an opinion on LiDAR and more complicated sensors versus vision-only approaches? Do you think LiDAR will always be needed to make things safe? Drago: I think ultimately, it's a question of...I don't know if LiDAR will always be needed but I think it's really great. And I know it's not very expensive, right? Lukas: Right. Drago: I think it even makes your computer vision much better and it makes your simulation much better, which then immediately also results in better driving. It's a fantastic sensor that you can just have for now. So, why not have it, right? I think there is this convergence happening, in some sense. LiDAR is becoming more like cameras. It's higher and higher res. Maybe it even can do passive lighting, so then it is a camera also while being a LiDAR. And it's cheaper and cheaper with the current technologies. On the other hand, obviously, our 3D perception cameras — even compared to two years ago — is dramatically better, right? I really like having the LiDAR. To me, it's a safety feature. It's a lot safer being in the car with LiDAR than not. Maybe it's theoretically possible to just do it with cameras and maybe it will play out, but do you want to risk it and why? I mean, it's easy to remove LiDAR. No one is stopping you, right? Lukas: Sure. Drago: It's not like we don't have state-of-the-art camera approaches. It's easy to remove. Maps too, right? Lukas: Yeah. That was my next question. I mean, how critical do you think the mapping is? Because that doesn't seem scalable necessarily, right? Drago: It's pretty scalable. I mean, you can do mapping with machine learning in some sense, if you design it properly. Generally, maps are a prior. They tell you about an environment, especially an environment you drove a lot in. What to expect? What is behind this occlusion, right? When you looked at this intersection, what does this thing really tell you to do versus not? Or what to expect around that corner? If you can have some of this information, why not use it, right? I mean, it's safer. Lukas: Right. Drago: Now, should you trust the map as is and require it is correct? That's not scalable. If you say, ""The map is given to me, I need to maintain it true, otherwise I can't drive,"" you cannot deploy autonomous driving at scale then. You don't have a business. I mean, people do construction, put cones. They change the traffic lights, they repaint things on the highways where the trucks drive, they reroute lanes. You need to deal with this otherwise you don't have a business ultimately, in the end.You can't trust maps blindly, but why not have a prior? I mean, we drive a city and even to do the safety case or just to collect data to understand what people do, why not have a map prior? Lukas: Do you think it helps enough that there will be one winner in autonomous vehicles that everyone uses, then it gets better because it collects the map data? Is it that much of an advantage? Drago: Which, the map data? I think generally there's scaling benefits in autonomous driving. I think a lot of the scaling benefits, they accrue when you use large machine learning models, right? You see the extreme case with GPT-3 in the big language models. In our days, when we studied with Daphne Koller, we learned that there is a bias-variance trade-off. You want Occam's razor, you want to penalize models that have high expressivity and you will get the best generalization, right? The simplest model that explains your data is great. Probably better than some fancy overfitting model. Now, all of this is on its head. You say, ""I want to train the huge model that is much, much bigger than anything on tons of data, that may be the same as mine or different. And that model will generalize better for me."" Right? Lukas: Right. Drago: Now, in AVs, what does this mean? We have all this data. Waymo has more data than the vast majority of companies or different platforms. I mean, we're 13 years driven, 20 million miles in autonomous model, right? We have whatever 20 billion miles in simulation. Simulation is also data. Now, we have cars and trucks. All of these things — if you take the large machine learning point of view — makes the models better, because you have more data. It's more diverse data. It captures..we try to see everything that you could see. If you do your job well, these models will actually generalize better. It helps you having cars to do well on trucks. And I have all this great car data, right? You add it to the models for the truck and it helps a lot. And car data is a lot cheaper to collect than truck, too. And maybe, a lot more diverse. I mean, often on the highways being a truck, you drive fairly conservatively and fairly few things happen on the highway. But it's a multiplier for you in the multi-platform setting, right? Our domain is friendly to this, I think. Lukas: Right. Could you talk a little bit about why Waymo is investing in trucks? It seems to me like a different enough domain — like more different than a different city — that I could imagine...my first thought would be, ""Well, you'd probably get the cities working first with a car and then switch to a truck,"" but it must not be. Could you talk about that? Drago: I think there's some difference between the two, but ultimately most of the pieces are similar enough that you can share. You can share roughly the same modular design. You can share roughly similar types of models. You can share roughly the same types of simulation environments. You can cross-benefit by cross-pollinating the data between the two domains. For example, to understand how others behave. You can just collect data of how people behave with cars. It will generalize to trucks, right? There's some unique problems with trucks that do have to be solved. One of them is you need to see a lot further for a truck, partly because a fully loaded truck takes a while to stop. Also, if you want to change lanes for a truck, sometimes you need to create gaps, right? And it takes longer to create gaps and do it without cutting people off unnecessarily, than for a car. So you need to anticipate a lot sooner, and you need to see around the trailer, or be smarter how you infer what's behind your trailer. There are a few of these problems, and that's why I have a bit different sense of configuration. But if you look at the core pieces, a lot of the other logic — like which modules you would put together, what to put in each module — is very similar. All the infrastructure is similar. Now, trucking is a very big use case, right? It's a big market. So it makes sense from that standpoint. There is enough cross-pollination and commonality, more so than differences, I would say. Lukas: Another question I wanted to ask...maybe you get this all the time, but such a common adversarial example is slightly modifying street signs to make a system think it's a different sign. Is that a toy thing that doesn't really come up and doesn't really cross your mind as a major problem, or is that something that you actually really worry about trying to create autonomous vehicles? Drago: In our case, we have three different sensors, right? I don't think you can fool three different sensors nearly as easily and independently. Furthermore, we have redundance between the sensors, right? When you want- Lukas: Right. Drago: Part of the beauty of having active sensors is one of them can fail and they can still fairly independently detect things for you, right? From that standpoint, hybrid stack with multiple different sensors is more robust. That's one. Second, I think generally, these adversarial problems fall in the bucket of robustness. And in some sense, unsupervised domain adaptation. You want to generalize to similar situations. And in research, we have studied these topics. We have methods currently that we've investigated that help against either transferring from one domain to another. There is a paper called SPG that we put up. That's an interesting take on essentially adding more structure to a prediction task, detection task. To make it more robust to new conditions. Like you train in sunny weather, then you want to work in rainy weather. It turns out that instead of just regressing 3D boxes, if you first have an intermediate task that regularizes, predict your point clouds, and fills it in, then from that, from this canonicalized, regularized point cloud, now you predict your box; it turns out you get a lot of robustness. We did it with unsupervised domain adaptation in mind. By the way, in the Waymo Open Dataset, we released some data for people to study this. We can talk maybe more about this later, about the Open Dataset. But we did it with this in mind and then we realized, ""Oh, this method is actually number one in the KITTI leaderboard."" KITTI is one of the...for hard detection cases. That was maybe a year ago. That's because when you do well — and KITTI is a small dataset — there are rare examples. When you add robustness to domain adaptation and you do it well, it just happens to do well on these examples, on more of these hard examples. So these are techniques that we're exploring. We have, at this point, significant experience with adversarial techniques. There's actually a large space of them. There is a challenge. Many techniques make you more robust to adversarial cases but really hurt your performance in nominal cases. The challenge is to find robustness methods to train your models such that you don't regress the common cases. If anything, you get better and you get more robust to the adversarial attacks. There are such methods. Lukas: That's a good segue into something I want to ask you about, which is the Open Dataset that you've been releasing. Could you maybe, first of all, talk about what they are? But I would love to hear the motivation and what's been surprising in the reaction from the community after putting them out. Drago: I would say that when I joined Waymo in 2018, and we started the research team, which is applied research team internally. Most of our work actually is primarily focused at improving Waymo systems with machine learning. We do publish too, right? A good amount but not all our work. We're not just made for academia. We wanted to engage better the community. Then the question is ""Well, how do I collaborate with you?"" or ""How do we encourage you to work on certain problems?"" At the time — especially when we started planning the dataset — there was the KITTI dataset. Which by modern measures — it was done I think in 2010, 2012 — it was tiny. Then we thought, ""Okay. The best way to encourage people to solve problems relevant to our setup — which is a lot more data — and the problems we're interested in, let's start releasing data that people can just push the state-of-the-art with. That's what the community does not have."" We released what I believe is still one of the largest and richest datasets, and we are actively making it better and better and better. If you checked it out two years ago, come back and see the kind of things we have now and we will continue releasing interesting data. We have 3D bounding boxes over time, 2D bounding boxes over time. Now, we have 3D semantics segmentation. We have 2D and 3D post key points for people in busy scenes type of data that has very little as other datasets you can see in the wild of such data. We have a bunch of interesting challenges. One of the interesting things is...we released the perception dataset and we picked 2000 run sequences. Which in its time, was quite a lot, right? So, 2000 20-second sequences compared to anything else, it's a humongous amount of data. Then we started trying to do behavior prediction task with it. If you do this, you realize that for behavior prediction, you need an order magnitude yet more data. Why? Because say, a scene of 20 seconds, it has 200 objects and you're maybe at 10 Hz in our dataset, right? That's tens of thousands of instances that you observed over these 20 seconds. And maybe you will see one interesting interaction or not in the whole sequence. From this standpoint, then I'm like, ""Okay, what is a reasonable size of data for behavior understanding and understanding interesting interactions?"" And we came up, ""Okay. If we had 2000 perception sequences, you want 100,000 behavior sequences."" Then of course, then the question is, ""Okay. If you release the sensor data for all of this, how are people even going to download it?"" Then we did some very interesting things. We released vectorized data of the environment produced by our sensors by actually novel systems we have. It's a system called auto labeling, which I think is pretty key for the autonomous driving space. Which in hindsight, after you observe the whole scene, you can try to as perfectly as possible to recreate everything that happened, right? We have novel work on this. It's published maybe a year ago, or two years ago. With this work, we actually made our dataset. It's still probably state-of-the-art of what you can do with these models. It's very clean data, a kind that was never done, so you can study aspects you could not before. Lukas: Have people engaged with it in ways that were unexpected? Has it been useful to you? Drago: People come up with very powerful models, which is part of the appeal. You have people from industry, from academia, even kids from high school in some cases. Like one of our challenges...which is really impressive to see just the broad, worldwide reach. What's interesting is we release it with some problems in mind and we help try to suggest problems. The way we try to suggest problems is...we've been running challenges for three years straight with prizes. So we say, ""Here's a problem. Here's a metric we believe is suitable for this problem. Please submit...here's the leaderboard. Here, you can submit. If you do well, you can win and come to our workshop."" This year, we also have a workshop at CVPR, one of the two premier computer vision conferences. You get to present. People participate and every year, we expand the set of challenges that we have. This year, we have three completely new challenges. Some are really unique that have not been run. Say, future occupancy prediction, both for occluded and non-occluded agents with flow that has...there's few such challenges. We have one on, ""From five cameras over time, can you reconstruct the 3D boxes accurately?"" There is variants of this for a single camera but for multiple cameras over time with rolling shutter, which is a real setup on a car. We worked out some very interesting metrics and set up that has not been done before. That is very core. A lot of people do appreciate, I think, the releases. And I think the more we release, the more different research people can do because now they can study how all of these enrich each other and how the perception and motion dataset...they have certain compatibility and you can reason how to combine some of these, and it gives you a lot of opportunity. But the last point is people started solving problems we hadn't thought of with this data, or do different research, including ourselves. For example, you can use our data to train NeRF models. I mean, you have all these rich data from all over the place. You could do that. Or you can train 3D reconstruction models, right? You can do shape completion models. I mean, there is a lot of things you can do when you have such rich data when we release two sensors, camera and LiDAR. If you have camera and LiDAR in interesting environments, you can do a ton. Lukas: Cool. Is it is a challenge to convince the business that it's a useful thing to release this stuff? Is there objections like there's IP that might leak out or even privacy issues possible? Drago: There were objections. I think ultimately, people...I'm thankful and Waymo is a great place to have a research team. I think it's a great collaborative environment with people that really appreciate the value we can bring, especially in an open-ended field. I think you can really balance the concerns, right? I don't think with us releasing the Open Dataset, it will give such a huge leg to the competition because we released some data for people to study. I mean, problems in this space, right? I think ultimately, it's really helpful to everyone, but it's not defining. I think there's a lot more positives for everybody than worries for Waymo and by releasing it, we hopefully struck a good balance. It has been a lot of work. Ultimately, we want to release data at a quality that befits the Waymo brand. That means that we need to take, say, blurring all the faces and license plates well. We need to make sure that the annotations are very high quality, which they are. We really paid a lot of attention and we ran models to keep mining for potential errors in our 2D and 3D annotations. I think they're very high quality. So hopefully, people can benefit from that. Lukas: We always end with two open-ended questions that I'd love to try before you go. What do you think is an understudied part of machine learning, or something that you would want to look into if you had more time? Drago: I would say that I'm perfectly happy doing the problems we have because ultimately, they cover...most machine learning problems are represented in our domain. I would say a few. One of the fascinating areas that we're looking at is...and AV really stresses that you want robust systems, right? And we touched on this. So what does that mean, right? This means many things and it depends which systems. One of them is you want to build inductive bias and structure in the... If you think of the whole thing as one big architecture, you want to build the right structure so it generalizes, right? This means pick the right API, pick the right designs and representations. There is a certain flow in our models, which I think now became a lot more popular in the whole ML community. You go from perspective view with tens of millions of points, scans, you name it. Then you create a Euclidean space, maybe in top down view with ultimately...with objects, with relations, with poly lines or structure. In that one, models generalize a lot better so you want to do more of this. That's one. The other one — which we touched on very briefly — but a big part of it is, when you train these systems and make them robust, you need to be able to detect the rare examples. Why do you want to detect and when? If you detects the rare examples you can, of course, bias your training set and metrics to make sure you do well on them, right? When you drive...if you know you don't know, it's already a huge help because machine learning models, you can think of them, they're very performant when you trust them. If you don't trust them, you can fall to something a lot more cautious and safe. You just need to know when. There are a lot of techniques you can study to do this. We can talk about finding rare examples if we get to it, but we have a whole bunch of research on this. We can, maybe after...there is another one that I find fascinating, and we touched on. This is domain gap between simulation and real world. How and what should be the simulation, such that you can train the best possible autonomous vehicle stack? How do I build it from the data I collected? What are the metrics for the simulator that it should optimize as realism and then how do you put planning agents in it, right? I think that is a fascinating- Lukas: -can you give me some examples of results in that? I'm not familiar with that work. Drago: There are several things you can do. There are several aspects of realism. You can think of it...when you put your vehicle in the simulator, you want to produce inputs to the vehicle that are similar or highly similar to what you see in the real world, right? Then the outcomes in the simulator are pertinent. What are the inputs to your vehicle? It's sensor data and it's the behavior of other agents in the simulator. These are the two main axes. Some kind of sensor data perception realism. Maybe you do some intermediate representation that's a lot cheaper than simulating every pixel, but you need something. Now, you need agents to behave realistically. Meaning they react to you. I mean, agents need to react to you, right? If you do something different, the simulator needs to cause an effect. It needs to...there's a reaction. It needs to be reasonable, right? How reasonable does it need to be? It varies. As you know, there is a strong work on randomization in other domains. If you want to train a more robust model, you can even try somewhat unreasonable things. As long as there is enough of them, you can build a more robust model. In our domain, you also want simulator to ideally be a good measure of risk. And that's a higher requirement. Then you need taller level for what realistic means because it needs to be somehow correlated to the real rates. Lukas: But how would you even know that though if it's reacting to what the agent does? How do you quantify how good your simulation is? The agent might do something that you never saw in the real world. How could you even know if the simulation is realistic? Drago: There's two measures in which you can measure realism of agents that we think. And we've presented in past talks. One of them is a Turing test of sorts. You look at the scene, it's like, ""Could this agent have done this? Is it likely or completely impossible?"" That's one. That's a proof of existence that it's realistic. Then you have distributional realism. Which is, let's say, how often someone will cut off in front of you, or what is the braking profile, Or how long does someone take to pay attention to you, right? That is the type of useful distributional realism that you can enforce and this makes sure that agents behave, at least on a distributional level, similar to what you observe. Now that we've observed a ton of behavior, right? So we have enough data to know roughly what the distribution of these things is. One of the challenges is agents acting in a continuous space. It's somehow practically an infinite distribution. But you can take slices of it that are meaningful and enforce that those are matching, right? There's certain designs there that you need to build in. Lukas: I would imagine there's parts of the distribution that you might care about but it would be dangerous to even do in the real world. But you might really care what happens if you slam on the brakes or make a hard turn. Drago: You can play any future you want, in theory, if you build it right. I think that simulator has this huge scaling promise. You take any scenario you saw, you release the agents and you release yourself, and you can try all kinds of stuff and they can try all kinds of stuff. And you can learn from that, right? It multiplies your data. If you have good models for the agents, now you have 100X multiplier on everything. That's fascinating. Maybe if you can score roughly how likely each future is, then you have even a likelihood estimate, right? You can sample adversarially and bias yourself towards the interesting cases. Try someone to try to cut in front of you when you're riding. Most of the times they don't and maybe 1% they will. Then you see what happens if they try. There are different ways to build it but you have the opportunity — if you do it right — to really dramatically increase, say, the cases of collisions you can replay. Because we don't see that many collisions. Thank God. Even when we drive a lot. But I can make you a lot in the simulator and some will be more realistic than others and it makes you a nice area to study these things safely, of course. It's the best if that's where you study them, right? Lukas: Right. Do you have more thoughts on finding unusual examples? I mean, active learning has been around for a long time. It's something that I think most companies use when they want to actually deploy something to production. Are you talking about active learning or something more complicated here? Drago: I will say a few things.Some of them are papers of ours, observations. Ultimately, our domain is ripe for finding the rare examples. It's one of the main tasks you need to do, right? I mean, most of the time you drive, it should be boring and you need to find...and we collect a ton of data which is great. Almost the setting is, you have some proxy for infinite unlabeled data, and you have some labeling budget. You can label yourself some data, as you know, you ran a labeling company. Now, how do you benefit the most from this data you just collected, right? Most of the examples are somewhere, if you can find them. That's the first observation, right? That's one. Now, if you were to find them, you can data augment a lot out of them. That's a good way to go, right? We have papers on how to perturb them in different ways. You can do this for cameras. You can do it for LiDAR. You can even machine learn how to best perturb them to get best results with that work, right? I'll get to you in a second about ways to find rare examples. There is a long-tail learning literature typically and a lot of the long-tail literature was driven in academia by dataset such as ImageNet or some like... I don't know, is it birds dataset? There is one. We used to do it- Lukas: Small dataset, yeah. Drago: We used to do it at Google — when I worked for Google Goggles — breeds of dogs, and types of birds, and types of food and all kind of things. Typically, the literature in long-tail is driven by these rich semantic datasets that you have some very rare thing. Like a rare breed of bird or a rare breed of plant, and then you need to detect it with five examples, right? But that is a world in which everything was named. Just this name was rare, and you just had five examples. Let's maybe learn to do the most with them. That's one way. Now in our world, it's a little different. In autonomous driving, you don't want to name every type of plant or even every type of dog. You have fairly broad categories. Like, take the category of vehicle to an extreme. There's all kinds of vehicles in the category of vehicle, and 80% of it will be boring sedans. Then down the line, you can have all kinds of strange configurations of things people do, right? Cement mixer with a trailer or something, or trams on the... I mean you can have. Now in this big bucket, what is a rare example? You don't want to name it. And rare is not the same as hard. That's also this property. I'll give you an intuition. I think the people sometimes say, ""Oh, we're going to train an ensemble and where the ensemble disagrees, we'll just label."" That's very standard. What's the problem there? Well, the ensemble finds hard examples. Models disagree, not easy to tell what it is. That doesn't mean it's actually, first of all, rare; second, beneficial to label. You can see this intuition maybe the easiest in LiDAR perception. You do LiDAR perception. If you do ensemble mining — we've studied this. Actually a great guy in Waymo Research called Max studied this — you get cars in parking lots far away. They have five LiDAR points. The models clearly disagree how the bounding box should be. But that's not a very useful example. It's not like if you mined more examples with five LiDAR points, you'll get much better on them, right? Lukas: Right. Drago: You need some mechanism to tell rare from hard. Lukas: What did you do? What's the intuition? Drago: You can build, for example, a model that estimates — given features of the examples — a distribution, and check which are actually rare things versus one we've seen a lot. We have a paper on this. Still unpublished, so I will not say more. But it hopefully will be soon. Obviously, that's one way to think about it but I just thought it's an interesting distinction. There is another work we did which is called GradTail, that is published, and that's a bit of the same idea. It's like, ""Let's define long-tail as uncertainty related to the model or the task you're doing."" It's not so much that some class is long-tail somewhere. It's epistemic uncertainty related to the model you're training. What does that mean, really? Again, this comes to say...actually, Zhao Chen, who is the primary author of that paper, has this reasoning, ""Some rare kind of apple is really important and relevant when you try to tell the types of apples. But if you try to tell apples versus oranges, that's maybe not even the relevant case."" So we have a definition of long-tail which is if an example — when you train it — has a gradient that is orthogonal or different from the mean gradient for the class. Lukas: Sorry. How do you define a gradient for label? Drago: I mean, you can backprop the gradient for an example. There's some layer and you can check on average what examples from that class- Lukas: I see. Drago: -pass as a gradient around this time. If you have different gradients...actually you can be either orthogonal or negative. You can argue which one is which, but if you have different enough gradient, it's ultimately a long-tail example. Sometimes you find different examples that semantics does not give you. For example, these types of examples have maybe a class ""fridge"" but the fridge is open. It has some strange point of view. That's a rare example. The class is common. You can predict depth or regress something. The rare example can be the depths of points far away that are close to something occluded. You can't even name these things but that doesn't mean that they don't exist, the long-tail in semantics at different concepts. We've explored this a bit because it's relevant to our domain. Lukas: Is the intuition that these change your model the most? Drago: The rare examples, they will improve the model the most because you can actually learn it. Lukas: Right. Drago: If you feed it. As opposed to the hard one, you can feed it but model will waste capacity trying to solve something that's hard to solve in the first place. But when mining, it matters. Now of course, there is a whole meta point, which is ""What is an example you should mine?"" If you have a full, all integrated eval of the whole system, you can try to introspect in the system. And at that point, the example you should mine is the one that cause you troubles downstream, right? That's the optimal world in which you mine. The problem is that now it's complicated, because it couples all your system and evaluation, right? If your evaluation is perfect, you should do it. If not, then some of these simpler, more modular approaches give you a lot of the benefit with a lot simpler set. Lukas: Cool. All right. I want to make sure I get my last question in, which is basically, what's the hardest part of making this stuff work in production? When you think about — from soup to nuts — making autonomous vehicles work, what's the step that's most unexpectedly challenging? Drago: Which question is this? Is this from research to production what's hard, or what is hard about release? Lukas: Research to production, exactly. Like you have something working in research, or something working in the Open Dataset challenge, or a thing that works in a Kaggle competition, and now you need to make the car go. Where does this break down the most? Drago: I mean, I'll tell you. Actually, my first experience with this is...I was at Street View, that was maybe 2007. I learned how to automatically calibrate the camera on the car to the IMU and the GPS system. Every once in a while, they were miscalibrated. The panoramas were all crooked in strange ways, right? You don't want to do it manually because there's so much data, so I came up with a system that did it and I had great results maybe in two months, say. Then it's like, ""All right, let's ship it."" Then you start...I ran it now on a lot more results, and I see all kinds of issues. Someone put a bag over this thing, in other cases the car got stopped for too long so you can't do structure for motion. You just find a bunch of them. That was maybe three more months. Then you run it and it's like, ""I'm there."" Then you run it again and of course, in a large enough dataset, everything that can go wrong does go wrong. Then you find a whole set of yet more rare cases that you need to worry about. So, three more months on those, right? That, to me, taught me, ""Oh, I see. From something that works good enough on the demo case — which is typically a paper — to something actually working, there is still a big chasm."" I think a lot of it comes from additional requirements. For a paper, you have academic metrics. They're usually permissive. They're usually fairly average type metrics over something, right? The only constraint is, ""Okay, there's this one or two metrics you picked. Let's show the main ones and let's show it works well."" Then you go to the production folks and they say, ""But we want this model to produce three more things and this rare case that it doesn't work on, that it should work on. And furthermore, it needs to run three times faster and you need to build it into this system with these constraints."" And you're like, ""Oh, great."" Now, my work may be tripled or quintupled. And then all the downstream models have to work with it, too. It can break them now, maybe produces new signals. Now I need to work to fix those, right? That's the usual story of how to get a research model in production. I mean, you need to persist and go. There's a stage two or three. The issue there is, even if one or two people are enough to get the demo result, they may not be enough to push the thing through production. Now, I need to ideally...we're lucky that we have a lot of great collaborations with the production team, so we just do it together. They need several people...you need a lot more people, ultimately, to do this, right? That's why also in research...we are applied research team. We are not there to try every possible thing and learn something. We're trying to ideally guess the right things to try, show that they work, and then spend sizable effort if needed to build infrastructure integration, often frameworks even, jointly with the teams such that many people now can accomplish it successfully. That's why the team is not too small either. When you actually have applied aspiration and shipping aspiration, you usually need larger teams. You need fewer, larger efforts because that's what's conducive to actually landing the things beyond papers. Lukas: Awesome. Thanks so much, Drago. This was really fun. Drago: Great talking to you. Likewise, thanks for having me. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out.",11155 +James Cham — Investing in the Intersection of Business and Technology,https://www.youtube.com/watch?v=T4LXx8Bs1kY,3971,2022-07-07,"James: There's still an enormous disconnect between what an executive expects to be able to do and what the software developer or what the machine learning person or data scientist actually understands is doable. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. James Cham is a partner at Bloomberg Beta, a fund that invests in machine learning and the future of work. He's invested in many successful companies, including my first company CrowdFlower and my second company Weights & Biases. I've worked with him for a really long time, and he always has really smart things to say about technology trends. I'm super excited to talk to him today. So James, you've invested in AI for a long time. You were the first investor in CrowdFlower, my first company, and you're the first investor in Weights & Biases. I was curious to know your perspective on...how your thinking around investing in AI has changed over the last 15 years. Clearly the market has changed, but I was curious to understand how your thinking has changed. James: You know, when I invested in CrowdFlower, I didn't understand that I was actually investing in AI. I thought that there was a broader collective intelligence problem that you were solving. I was really enamored with both crowdsourcing and flash teams, at the point. And to be honest, I kind of still am. I still sort of...in some ways that I think about AI — or machine learning more specifically — kind of as a misnomer, I think that it's actually a collective intelligence thing that's going on. That's on the broad, theoretical side. The big change on the investment side, I think, is we went from a place where people actively didn't want to invest, or where I actively — there are a couple of folks that you and I both know who I actively encouraged not to use the word ""machine learning"" because I thought it hurt their chances to raise money — to a world in which now we live in where there's an incredible amount of investment. What's interesting about the incredible level of investment right now is that we're still sort of at the cusp of getting actual great business results, right? And so we're sort of at that point right now where I think all the pieces are almost all there. But they're not quite, and everyone feels that...you have that little bit of impatience where everyone kind of wants to get it, and the talent's not quite there, or the executives don't quite understand it. That's an uncomfortable, but also really exciting point to be in. Lukas: Do you think there's some chance that we're set up for disappointment? James: We are always set up for disappointment. You know that as well as I do. Lukas: That's true. James: Lukas and I, I'm lucky enough to...every two weeks we have our little morning chat. I feel like we have recurring themes and one of them is this continued question of, ""Where are we in the market?"" And you have to admit that the last few quarters, there's this sense that everything is coming together, right? But at the same time, as you feel like everything's coming together, you're still looking behind you to say, ""Oh goodness. In what way are we overselling? In what way are people misunderstanding things?"" At least to me, it feels like there's still base levels of understanding that are missing. And it still feels to me like there are opportunities to define the market in the right way, rather than the buzzy silly way. Lukas: When do you think investors kind of flipped from feeling like machine learning was a science project to machine learning was a good business to invest in? I mean, you've always done early-stage seed-stage investments. That's probably where the change happened the earliest, but when was that and what was going on that caused that change in mindset? James: You know, there's this little joke around Google and Facebook where, you know, ""What do startups really do? We commercialize things that Google figured out five years ago,"" right? And then we bring it to the rest of the world. There's a little bit of that sense that that's not ridiculous. That you saw the kind of changes that people were able to implement and build inside the big FAANGs and then realize that this should be more broadly available. So you had that on the one side. And on the other side you had these remarkable...well, okay, how do I think about this? I think on the academic side, you had a few things happen. On the one hand you had great results, just super impressive results. But also there's a way in which academics sort of figured out how to play the game, in the sense that the machine learning world was well-defined enough now that people could compete on some basis that they understood. I remember there was this guy who gave this great pitch around how to think about advances in machine learning. He made the point that, actually, maybe it's really about the size of the dataset. Do you remember who that guy was? Lukas: Do you think that's still true? James: That was Lukas by the way. That was Lukas. Just to be clear, just to be clear. Do I think that what is still true? Lukas: Well, I do think the size of the dataset is incredibly important. And I think maybe 5 or 10 years ago, I thought it was really the only important thing, and the advances in algorithms seemed pointless to me at the time. But I think in retrospect — maybe I didn't have such a quite extreme view — at that time it wasn't clear that deep learning worked much better than traditional methods. There hadn't been a lot of improvements in algorithms for a really long time, so almost all the advances felt like it was coming from bigger datasets. But now I look at OpenAI and DeepMind, and it feels like a lot of the advances that are happening there is, on one hand, coming from bigger datasets making more advanced modeling possible, but also advances in compute. James: I've got a nuance on the extreme claim you used to make. Which is, I actually think it's that with the availability of large datasets — but also with the understanding that these large datasets were available — it meant that everyone understood how to play the game. It meant that you have a whole wave of academics and companies and corporations and groups and teams saying, ""Oh, we can play with these sets of data in interesting and novel ways."" What that meant is that the thing that was the scarce commodity...or the way that you basically laid that piece out, meant that people were able to work on it. And then that's where you get all these exciting advances. In part because everyone agreed on how to think a little bit about the data. Lukas: You know, I wanted to ask you too...I think one of the things that you did really well was maybe starting a real trend in content marketing among VCs, when you and Shivon put out the machine intelligence infographic where you laid out all the companies. I was curious, what caused you to start it? And then I feel like it became wildly successful and you stopped doing it. Many other people have picked up where you left off, but without the same — in my opinion — quality that you had. Can you tell us the story behind that? James: Sure. When the fund started, I think there was a sense that we were at the tail end. Incorrectly, there was a sense that we were at the tail end of a bunch of investment around big data, and that there were a lot of failed, big data projects sitting around. And so then the question was, ""What are you going to do with all that investment and understanding and collecting data?"" One of the claims, or one of the guesses, was that you'd use that data for machine learning, right? There are a bunch of AI applications. And my old colleague, Shivon Zilis, pushed that insight a lot. I think in part because she felt it just intuitively, but also she was surrounded by a set of folks who were playing around different places with it. I think we were both sitting around thinking, ""Wow, this is just so hard to understand,"" and we couldn't make heads or tails of it. Basically, what happened was...you know, Shivon being just a really great synthesizer, but also someone who's quite dogged, decided to go work with another friend of hers who figured out ways to cluster different types of businesses. She basically clustered a bunch of different types of businesses that included a number of keywords around AI, and then categorized it, and then stuck it on a map. I think that was like a two-month process to actually go through all of that and have all these horrible spreadsheets. Because it was super...there are products now that do this, but it was super manual in some ways. And what was exciting about it was, the moment she put it together — I give her all the credit for actually doing the real work — then suddenly it felt like this world was legible for the first time. Then I think we kind of assumed that there should be people working on this full-time, rather than having this just be a part-time job, and they would do a better job of it. For a few years, basically Shivon would take some time off right around the summer to just do the state of what's going on. I think it was really good. The categories were not always right, but at least it gave something for people to agree or disagree on it. And then made a bunch of connections for folks that I think would...it's still valuable to this day. Why did we stop? I don't know. Like, there are too many companies, right? Part of it is there are too many companies. Part of it is, I do think there is a new class of journalists who now think that way, right? Who think that mix of computational plus willingness to do the work, plus not sort of subject to the day-to-day grind of reporting the next story. And they should be coming up with those conceptualizations. But I haven't totally seen...I do think it was a novel contribution at the time. Lukas: One thing that I know you are very interested in — because you talk to me about it all the time — is how organizations function as a collection of humans trying to work together towards a common goal. I feel like you think about that more than most, and you think about machine learning more than most. I was curious how you think — or maybe how you've seen — organizations adapt to machine learning becoming more mainstream within them. And I'm curious if you have predictions on how organizations might continue to evolve as machine learning becomes a bigger and bigger part of them. James: We're not yet at the point right now where machine learning is boring enough that it could be adopted easily. We're still in the part of the market, or part of the phase, where there's plenty of exploration and plenty of definition and ecosystem definition to be had. And you see some of that in like slightly misguided arguments around augmentation versus automation. I think you only have those sort of theoretical questions when people don't have actual solutions they're dealing with day-to-day, right? But I think that there's definitely...that's the first part. The second part is...management theorists have thought for a long time — or talked about — the idea of a learning organization. That organizations will actually get better over time because they learn things. Generally that's just been a metaphor, right? Because of course organizations are not people. They don't have minds, they don't learn anything. Maybe things get codified, and processes or rules. Part of what's exciting about machine learning — in the next, in the pre-AGI version of machine learning — is that we could actually digitize a bunch of decisions that get made on a day-to-day basis. And we can actually literally learn from them, right? Something as boring as, ""Do I go to this meeting or not go to this meeting?"" or something as important as, ""Do I invest in this project or not?"" All those things in the world we live in right now have almost no consequences. No one actually follows up on a consistent basis to make sure or understand whether things work or not. Or they do, and it's incredibly expensive and difficult. Just think about...not you guys, but maybe some other theoretical organization will have to spend all this time just digging down to figure out ""What product..."", like ""What random marketing campaign actually happened or didn't happen?"", or how well it worked. And just the amount of automation people need to put in in order to systematize that. What's exciting about...at least to me, what's exciting about the sort of data-rich ML world we could be living in, is that those decisions we can now find out whether they actually work or not. And then we can actually maybe consistently start making better decisions. Now, there are also a bunch of...you were going to say something, what were you going to say? Lukas: Well, let's take your example of ""Should I go to a meeting or not?"" How do I ever even know in retrospect if I should have gone to a meeting? How could an organization really learn whether or not it makes sense to go to a meeting? James: Okay. I think there's...one of the other angles that I'm very interested in is that intersection around machine learning and the social sciences. You'll talk to management folks that are rather on the AI side, and there's always this question of ""What's the objective function?"" The interesting thing is that on the social sciences side, they've learned the lesson. Which is, ""I don't know. We'll have some objective function and it'll be good enough to sort of manage, but it'll never be perfect. That actually will have to change over time because the most interesting systems that are all dynamic, they're dynamic because people are interesting."" That, pnce you decide that one metric is the right way to measure whether a meeting is good or not, people will start to learn that and they'll start to game it. They'll be like, ""You know what, whenever Lucas smiles twice...I'm going always make sure to make it, I'll tell some stupid joke."" And it'll detract from the actual purpose of the business, right? I think that the illusion is that you'll come up with some perfect metric. And I think the actual goal is to continually come up with metrics that slightly will change over time, and you'll understand what works or doesn't work, but that'll be okay, right? In traditional organizational science, there's this great paper called, ""On the folly of wanting A and measuring B."" And I think that problem is going to be forever. But that's part of the fun of the job, right? That's part of the fun of creating organizations and social systems. Lukas: I totally agree with that, but I feel like...I don't want to harp on this case too much, but I'm curious because I always wonder myself if I should go to a particular meeting or not. How would you even make an imperfect measure of that? What do you even imagine looking at to- James: -so you can certainly imagine it visit. You can imagine it as, ""Is the meeting useful to you?"" You can also imagine it in terms of, ""Is the meeting useful to increase the collective intelligence of the organization?"" And you can certainly do direct measures, which we can just literally ask you, ""How good was that meeting?"" afterwards. Or we can literally ask the team, ""How good was that meeting?"" afterwards? Or we can literally look at the number of things you write after that meeting, or we can literally look at the number of times that you nodded or didn't nod. Which is just to say all those signals are increasingly cheap to gather. And when they get cheap to gather, that's when we actually get interesting innovation. When it's incredibly expensive — when you need to hire McKinsey to do some study and then hire a bunch of people to build some very expensive bespoke system — then it's not that useful, right? Because then your ability to move and play with the edges of your social system becomes too difficult. And then you're sort of...your chance to actually design it on the fly and continue to understand it... That's, I think, the interesting edge around social systems. Lukas: Interesting. Where do you see machine learning making a meaningful difference in organizations today? James: In all the normal places, right? We're now finally getting good enough to cluster large-scale bits of information in ways that are meaningful, so that we can provide consistent responses. I think that that piece of it — which is the big version of machine learning; finding the most critical decisions you need to make, the most digitized pieces, and then finding ways to consistently improve and collect it — I think that that's where most of the energy and opportunity is right now. But that'll change, right? That'll change. I think that the exciting...does that make sense, first of all? Lukas: Yeah, totally. James: Let me take one slight digression, as we're talking about this. Of course, the real answer is that executives could know how to apply machine learning if only they understood a little bit more than what they learned from reading or watching a movie. There's still an enormous disconnect between what an executive expects to be able to do and what the software developer or what the machine learning person or data scientist actually understands is doable. I do have to make the pitch — which I think I've done too many times to you — which is, I do remain convinced that the three- to four-hour class that you used to teach to executives on how to think about machine learning probably is the best...if you were to say, ""What's the best way to improve the way people think about machine learning?"", you should make your boss's boss take a three-hour course and just sit around and play with a very simple machine learning model. Because in that process, they will at least have some intuition about how incredibly powerful, unsexy, brittle, finicky, and incredibly scalable some of these models that you build will actually be Lukas: Well, it's not the core of our business, but I am passionate about doing it. And really it's not that we shut down those classes, there wasn't actually much demand for it. Or maybe we didn't pursue it aggressively enough. There was much more demand for the tools that we build. But I'm curious, when you did the class... James: Go ahead. Lukas: Maybe I'm actually just soft-balling a pitch to you, but I'm curious. It seems like you really liked that class and really felt like your team got a lot out of it. But really, what was it that you feel like you took away from those couple hours of building models? James: What you did is...it was to a wide, non-technical audience. Well, a few technical folks. What you did is you gave a little overview and then you had them fire up an IDE, open up some things in Python, access some data of...I forget, what were they? Socks? What were the images? Lukas: Oh yeah. FashionMNIST, for those in the audience. James: That's right. You gave them a very straightforward framework, but you had them play around with slightly different approaches. You gave them the opportunity to see the results, and you gave them the opportunity to play with different parameters. And then you introduced a few curve balls. It was actually a very straightforward exercise, but it was curated, and it was accessible to a wide range of folks. What was interesting about it was that for the first time, rather than thinking about the grand vision of machine learning, you had a wide range of folks thinking about it from a very concrete...sort of the way that a developer would, right? Where you're actually dealing with data, and you're thinking, ""What does this actually do?"" And you're thinking, ""Oh my goodness, it's totally broken. But, by the way, I could also just apply this to 50,000 images instantly."" Which is an amazing feeling for someone. And is a different feeling that you get from building software, right? I think that that intuition...I'm kind of convinced that you could teach this to Nancy Pelosi, and she'd learn something, and she'd make better policy decisions as a result of that. I'm kind of convinced that if you...we've done a slight variation of this with a couple other executives and it worked really well. At least to me, it feels like that shift in mindset — and also just like the little bit of finger-feel — meant that folks just had better intuition. I think it made a huge difference. And then they also ask better questions. Lukas: One thing that always surprises me about VCs — because so many come from a quantitative background, and I feel like there's so many investments being made — is the lack of rigor in the decision-making processes, as far as I can see. I'm curious. At Bloomberg Beta, do you use any machine learning or any kind of...is there any kind of feedback loop where something's successful and then you decide to invest more in that? James: Only for top of funnel, only for top of funnel. In our case, we're seed-stage investors, right? So our process for follow on is very different from a bigger fund. But I will remind you though, part of the fun of venture is that the game is constantly shifting. If it was exactly the same game, if the business models were exactly the same, then it'd be kind of like everything else. It'd be no fun to be routinized. Part of the excitement of the job — but also part of the opportunity, and the only reason it exists — is that there are chances for new business models to emerge where the old metrics no longer make sense. Those sorts of windows come around every so often. To be honest, where there's that kind of uncertainty, where there's either a key technical uncertainty or key business model or market uncertainty, that's where the amazing opportunities come from. Lukas: You've been doing venture for quite a while now and have seen a lot of big wins and losses. Is there anything consistent in terms of what you saw at the beginning of a successful company? Or does the venture market sort of adapt to whatever that is, and the opportunity goes away? I'm sure you reflect on this because it's kind of your main job. Are there any kind of common threads that you see in the successful businesses that you've backed? James: Inevitably there are arbitragers that exist, or there are ways to tell signal from noise. But because the market is clever, and you're dealing with founders who are really smart and care a lot about what they're doing, you're going to end up seeing...they'll end up imitating those signals of success, right? There's a little bit of this constantly shifting game where you're looking for a new signal to say that, ""This thing means that these guys are high quality,"" or ""This insight is really important."" Then they'll figure out, ""You know what I should do? I should make sure I game Hacker News, and then I'll get all my buddies to go on Hacker News, and then we'll coordinate."" And that'll no longer be a signal, right? Or, ""You know, what I really should do is I should make sure that all my friends are consistently starring my open source project on GitHub."" Just meaning that once you figure it out then...this goes back to why I think these sort of dynamic models are so much fun, right? That's the whole point of it. You march on and think, ""Okay, what's another signal of success?"" Lukas: I'm curious. At this moment, if I showed up and I was pitching an ML company and my customers were maybe the less tech-forward enterprises — I feel like I probably shouldn't name names because some of them are Weights & Biases customers — but if my customer base was, you know, Proctor & Gamble and GE, would that be more appealing to you than if my customer base looked like Airbnb and Facebook? How would you compare those two? Is one obviously better? James: I do think it entirely depends on the nature of the product and the nature of the solution. The way that I think about it is that there's like a gradient of admiration. And in different types of markets, different people are higher up. Imagine that map, right? The higher up in terms of admiration... In some places, in some markets, in some set of developer tools, then actually it does matter a lot whether or not the early adopters come from the tech-forward, or from Facebook, or whatever. But in plenty of markets — and increasingly as machine learning gets mainstreamed — the questions will all be around business benefit. And then the question is, ""Who are the companies that other people admire, or look up to, or aspire to become in those specific markets?"" I think that's part of the shifting nature of the game. Lukas: I see. Is the gradient of admiration always clear to you? James: I mean, you could...Okay. The secret fun part of the game is when you figure out what that gradient looks like before everyone else does, and then you play with people who are higher up there. You figure out, ""Yeah, everyone's going to admire the data scientists at Netflix,"" you know, whenever that was true. And then you play with them and then you come up with much better insights. Or you know, when it was true about whatever organization. It's not that complicated to think about, right? You just ask people, ""Who do you like?"" or ""Who do you look to?"" And I think that constantly shifts. Lukas: One of the things that we were talking about, that I thought was intriguing, was you mentioned that businesses focused on ML — even if they're not selling ML, but using ML for applications in different industries — you expect them to have a different business model potentially. My thought is that the business model would match the market that they're selling into, but you felt differently. I'm curious to hear your thesis on that. James: So, I'm a VC. I'm only right occasionally and I believe most things provisionally, right? But I'm pretty sure about this one. I'm pretty sure that we underestimate the effect of technical architectures on emerging business models. If you were to go back to like Sabre — which IBM builds for American Airlines, right? When they have a bunch of mainframes — in some ways that business model, which is ""We'll charge a bunch of money to do custom development for you,"" really comes partly out of the technical architecture of the way that mainframes were centralized in some other place. And the moment that PCs come around — or they start to emerge — there's a way in which we think about maybe the best business model ever, which is to say, the one that Bill Gates creates. You know, which you charge money for the same copy of software over and over again. It's an incredible business model. That partly rises because Bill Gates and Microsoft and a bunch of folks were stubborn and clever and pushed through an idea. But part of it was also because there was a shift in the technical architecture, that you ended up with a bunch of PCs. And so then a different business model...because there are different economic characteristics of how that technical architecture is both rolled out and how it's developed and how you get value, then some different business model might make sense. Then you see the same thing for the web, right? When you have a ubiquitous client in 1995, I think everyone realizes that that means something new, and it takes five or six years before people come up with the right way to talk about it. But subscription software really only makes sense, and only works, in a world where you have a ubiquitous client that anyone can access from anywhere. Which is sort of a shocking idea, right? You compare it to like delivering CDs, before. Or before that, someone getting a printout of some code that they were supposed to re-type in. In each one of those cases, it's enabled...and there's some new dominant business model that comes about because the technical architecture shifts. Of course, that only enables it. It's really the people who build the thing, and market it, and sell it, and come up with the new dominant business model. They still have to do that. But it just strikes me that the shift that we're going through right now around machine learning, or data-centric applications, or this change in collective intelligence; however you want to talk about it, the nature of building those applications is different enough, and the technical architecture is different enough that there should be some other business arrangement that ends up becoming the better one for both consumers and for some due-dominant customer. You think about how on the machine learning, model-building side, like you just think about the amount of data you're trying to own and control and understand and manage. And you think about how that changes what's a scarce resource. It just strikes me that there's something there. So to be honest, I'm constantly looking. In my mind, what's my grand dream? My grand dream is to meet that person who's working inside one of the big companies, who's been frustrated because she's understood how the grain of the wood of machine learning lends itself to some new business. And her boss's boss is like, ""That's stupid. We need to maintain our margins,"" or whatever. Solving it, that's the grand dream. That I'll find that person, and be able to invest, and partner with them for a number of years. Lukas: In your imagination, are there ways that that model could look? As opposed to...it's a little bit hard to imagine these new things, but, you know, subscriptions have been around for a while. Do you imagine a move to more of a usage-based pricing, or maybe companies that are willing to pay for your data and combine the data. I'm trying to picture what this could be. James: Let me describe something. I led a little conference chat the other day, a little session about this. Anywhere I go, I try to lead a session on this because I'm kind of obsessed. Certainly, usage-based is quite good and interesting, but I would just contend that in some ways, usage-based sometimes puts me as a vendor at odds with my client. Because I just kinda want you to do more of the thing, right? Sometimes it's not really useful because...I don't want to name names, but we are certainly in a world right now where people are wasting a lot of money — either on compute or storage without clear business value — and then they're going to some day actually figure it out. And then cause a lot of trouble, right? I think that that's the pro and con of usage-based. There's certainly some notions around data co-ops, where the realization that as these models get better, when our...we share our data, maybe we share upside together. I think there are a bunch of folks who are trying variations of that. The dream of course, always is to be in perfect alignment with your customer. One way that happens is you have something like a value added tax or a VIG, where you benefit when they benefit. But right now — in the world that we live in — understanding that benefit is so hard, right? Because it requires an enormous amount of infrastructure, and management layers, and AB testing, and blah blah. Just think about all the problems, all the reasons why it's never worked. Maybe someone will figure that out. Maybe all the objections that we've had for the last X years around why this sort of benefit-driven business model doesn't work. Maybe it'll work with some twister turn of how we think about machine learning models. Lukas: You had me convinced many years ago a competitor would come along to Salesforce, that would aggregate the data, and use it in smart ways. And Salesforce has this inherent disadvantage because they're so careful about keeping everybody's data separate and not building models on top of it. Do you still believe that's coming or do you think there was some wrong assumption that you're making? Or has it happened quietly and I haven't noticed it? James: No, it hasn't happened yet. I mean, Salesforce is this enduring great business, right? That's going to last for decades and decades. That said, it still does strike me that there's an inherent tension. You think about all the trouble that they spent convincing me — or convincing people like me — to work with them, because we believed that the data was safe in their cloud. And then just the idea that I might share data with other clients is crazy and terrible, at least from that point of view. So there's that inherent tension in the traditional, or the now established, SaaS view of the world. I think it's very hard for the incumbents then to move off of that sort of way of thinking about the world. But harder yet is convincing their clients and their customers who've been trained to think that way, right? There's a funny — maybe not funny — story, where Microsoft got in a lot of trouble at some point for sending information back to their main servers about how PCs were doing. They would crash, or there'd be some bug report, then they'd automatically send it back. That was a huge scandal, because ""How could Microsoft be looking and stealing all my information?"" The hilarious thing...not hilarious to Microsoft, but the hilarious thing about that is, that's right as Google Docs is starting. In the case of Google Docs, Google literally sees every single thing I type. I mean, literally stored on their servers. And somehow, because it's a different configuration or different expectations around the business, I'm okay about it. I think something similar will happen with some emerging sets of machine learning-driven businesses. Lukas: It's interesting that you say that. You had a really interesting viral tweet at one point showing how much better Google's transcription was than Apple's. Which I thought was really interesting, and actually made me think about the same point. Apple is so known for being careful with privacy and Google is known for being much more laissez faire with people's data. But it's not clear to me that Google has used that perspective to create a huge advantage. At least in terms of market cap. Do you think over time Google's point of view will really serve it, or has something changed? James: I think that in that case, it's a little bit of a slightly different nuanced thing, right? I mean, why was that Pixel 6 voice recorder so much better? It was better in part because they had an on-device model, that was one part. And another part of it is that they just collected data with much more thoughtful ways. What did that mean? That meant you had a very fast, very accurate local experience. The fact that that's true...that's definitely true, but it's also confounded with the fact that Google is a very large organization right now, and they've got lots of things they worry about and lots of ways that they're unwilling to take risk. In my ideal world, someone who built the sort of technology that Google did around voice would have decided that, ""Oh, you know what? Actually, this should be part of some SDK or some API, and we should just make this available for everyone."" And developers should be building a bunch of products. That's the other thing that I think we're on the cusp of, because we're just at this point where there's this massive investment in infrastructure, and research, and tooling around machine learning. And we're right at the point where maybe people will build products that are actually good, right? We're just at the point where the lessons learned around how human-in-the-loop works, the lessons learned around experiences on user interface; all those things, they don't quite take...or value-added to the end user. We're just at the point where there'll be enough variation that some ideas will actually take hold. So I'm sort of excited about that part too. Lukas: Are you starting to see that? Because I feel like maybe I'm too impatient, but I can't believe how much better all aspects of NLP have gotten in the last few years. I feel like transcription is now solid, translation now works. I mean, it basically works. You can communicate for sure with people that you don't speak the same language with by using a translation system. Hugging Face and OpenAI's GPT-3 have incredible demos. And yet I don't feel like it's impacting my life that much, except for asking Alexa to play me music. James: You're exactly right. We're at the point right now where I'm hoping that your listeners are building products because now it's easier to access it. You know, there's this talk about democratization of machine learning. We talk about this often, I feel like. But I think it kind of misses the point. The point is that by making this more broadly available, it also means that the extraordinary person on the edge — who might not have had access to try this before, the person with the crazy idea that will make a huge difference once we actually see it — that they can start working as well. That's part of the exciting thing that I think everyone misses as they talk about the way that this whole world is shifting. But you're exactly right. That we should be deeply dissatisfied with... On the one hand, all the progress that's made voice and parts of NLP, we should be super impressed with it. And we should be deeply dissatisfied because the products, and the product minds, and the UI folks, and the business minds have not yet figured out how to take advantage of those advances in ways that actually make sense, and go with the grain of the technology. Lukas: One thing that I would imagine being hard as an early stage investor investing in machine learning is that it's so easy to demo successful cases of machine learning. I feel like no other field is it quite as easy to make a compelling demo. And yet it feels like to make a working product, it's often going from like...bringing the error rate down from 1% to 0.1%, or something like that. Do you have trouble evaluating? James: Here's my secret [?]. I'll give you one of my current secrets. Lukas: Okay. Tell me. James: I just assume it doesn't get better. If the application requires the thing to go from 95 to 98, or 98 to 99...what if it doesn't get better? Will users still get value out of it? If users still get value out of it — because of the way they configure the problem — then it's an interesting product. But if you're sitting there thinking, ""It'll just be another month before we go from 98 to 99.5,"" then I'm like, ""Well, I don't really know if I believe that."" This goes back to one of our earliest conversations around search quality. This is like many, many years ago. What's the beauty of search? The beauty of search is that when it's wrong, I'm okay about it. There are whole sets of products in which you can take advantage of the fact that it's super fast, it's consistent, and when it's wrong, I'm okay about it. You do that over and over again, or you find the products that do that; then those are interesting applications. Lukas: For an investor, you're doing an extraordinary job of not bragging about your portfolio, but give me some glimpse of the future. What's the exciting stuff that you're seeing lately? James: There are two parts that I want to talk...that I sort of want to highlight. On the ML infrastructure piece, I still think that there are analogies or lessons to be learned from traditional software development. I think that you guys have done such a good job of understanding so many pieces out of that, but I still think that...you think about QA, like figuring out how to consistently do QA. I think there are lots of lessons to be learned from normal software development to be applied to computer vision and structured data, and those sorts of release processes. There's a company called Kolena that's in the middle of figuring out parts of that. You look at companies like...we talk to Sean every so often. You look at the demo, like the publicly available stuff about Primer. And just imagine what they're actually doing under the hood. If you go to primer.ai and you look at their ability to synthesize huge amounts of data and lots of articles, and just make sense of the world; and imagine applying that...in their case to a bunch of national security use case. If you look up various things that are happening in the world right now and the word ""primer"", you'll see these demos. They can't show you what they're actually doing, but you get that sense of, ""Oh, this is changing the way that people are actually doing things right now."" That's the sort of thing that I feel like on the application layer, but then also in the development part we're just sitting on right now. Going back to my secret ARB — which is I just sort of assume it's not necessarily going to get that much better — there's this great guy, Michael Kohen, at this company called SparkAI. Their big insight is similar to that line. They're like, ""Look, we want autonomous vehicles and we want them to be perfect. But they're not going to be perfect for a long time. So let's just make sure there's a human in the loop."" You can think of them as like...whenever the machine is uncertain about something right in front of them, they'll get a response in a pretty short SLA to make a decision. And thus you can actually roll out these sort of real world applications with the realization that the model doesn't have to be perfect. That we can actually have backup systems. I think that sort of perspective — assuming the sort of non-utopian view of what's possible with machine learning — is super exciting to me. Lukas: I'm curious what you think about — and I guess this is a broad question — ethical implications of machine learning. Many people talk about machine learning and ethics and I feel like there's constantly in the news issues that come up with machine learning. What do you make of it? Do you feel like there's special ethical considerations unique to machine learning — different than technology — or not? How do you think about what kind of world you want and what regulations make sense? James: I think it's a good thing that we live in a world where people are more sensitized. On the one hand. I'm very glad to see lots of people applying their minds towards it, on the one hand. On the other...so, this might slightly get me in trouble. There's a game that I play with friends of mine who are ethicists, who are thinking about the effects of technology. I ask..I think it's appropriate to ask these questions around what are the implications with this or that. If you were around in like 1950 and someone proposed the compiler to you; for the first time, someone said, ""We've got this really, really great way of making software easier to develop, and available at mass scale, and et cetera, et cetera."" Would you have allowed me to build a compiler? Just imagine all the harm that could come from a compiler and imagine, to be honest, all the harm that has actually come from compilers. Everything from hacking to stealing money from people, et cetera. There's a way in which I think there's a reasonable argument that we wouldn't...given some current frameworks, there's an argument for why we should not have had a compiler. Which seems, on the face of it, at least to me, crazy. Right? Absurd. To me, the questions instead should...there should be this sensitivity, and there should be these sets of questions, but in some ways the questions should all be around, ""How do we think about what do we do if we're wrong?"" I think one of the beauties of machine learning is that embedded in machine learning — at the very core of machine learning — is this idea that these are not fixed heuristics or business rules. Actually, these are guesses that we just have to assume will be wrong sometimes, right? In that way, once you think from that framework, or once your executives understand that's how models actually work — that they're wrong, they're never going to be perfect. Otherwise you can have a big ""if then"" statement — once you realize that they can be wrong, then you need to build the systems and the processes to deal with the fact that they could be wrong. You also need to build a whole set of ethics and ways of thinking about...questions more like ""responsibility"" rather than ""possibilities"". And I think that shift in the way you might think about machine learning, I think it will be much more profitable, in the sense of being useful for humanity. What do you think? Lukas: I guess it does feel like machine learning might not be as neutral as compilers in some cases. If you imagine it taking inherent biases that we have in our society and then encoding them in a very efficient system, so that they can be deployed at bigger scale and with possibly less oversight. James: Right? That's only if you fall for the idea that we're trying to build an all-knowing God brain that will solve things for us perfectly. To be honest, oftentimes when you'll talk to executives, that's how they'll think about machine learning. They'll think, ""If only we can get this perfect, then we can rely on it forever."" But instead, if we thought about it as a bureaucracy that is right some of the time, but wrong too. If we thought about it as a possibly fallible system, and we built in the support for that...because the nice thing about machine learning is that it's incredibly cheap. In the grand scheme of things, it's incredibly cheap to make these judgements. And also it's centralized. Right? By being centralized, and being cheap and conscientious — meaning it's consistent — then you actually have one place where you can go and you can always say, ""We fix it here, we can fix it everywhere."" That's one part of it. I think the other part that you highlighted — which is it captures inherent biases — that's the other part. Which is, in some ways it's a problem with the way that we anthropomorphize machine learning. One way to think about it is this amazing genius thing. On the other hand, you could just think of it as an incredibly conservative attempt to cluster collective intelligence. If we understood that machine learning was derived from data — and data is by nature historical, and anything historical by nature happened in the past — then I think that changes a little bit your expectations about what the model could do. And then it changes your expectations around what layers you need to put on top of it. Because you can't just rely on the model. You're going to have to have both straightforward business rules to protect yourself, but also you have human processes that will actually think it through. I do have to, at this point, make the plug for one of my favorite papers, which is called ""Street Level Algorithms,"" which talks a little bit about this idea. You'll have to link to it, I don't know if you've...have you read it? Lukas: No, no. James: I think I've tried to make you read it many times. It's totally worth reading. You should get Ali or Michael Bernstein to chat about it at some point. But I think their core insight is that if you did think about machine learning models as bureaucracies, or as processes that could be wrong some of the time, that you change your expectations. But also the ways that you can take advantage of machine learning, which is to say, ""You fix it on one place, you fix it for everyone,"" right? Those sorts of inherent advantages go with the grain of the technologies rather than against it. Lukas: Have you ever gotten a pitch on the company and not invested because it made you uncomfortable? Like from an ethical perspective? James: Oh yeah. Plenty of times. And I think- Lukas: -really, plenty of times? James: There are plenty of times when I will say...on the one hand, I'm utility maximizing, but then I have my own idiosyncratic definition of utility. And my definition of utility doesn't map directly to just dollars, but maps into ideas of who I am and what kind of person I want to be and what kind of world I want to be in. I think that that's true about all VCs, right? Everyone pretends that they're...rather, a lot of people pretend that they're pretty straightforward in dollar maximizing, but that's not true. We all have tastes and we all have things that we like, or don't like, and good or bad reasons to say yes or no to things. And I think that reality is always sitting with us. Lukas: Is there a company that you feel like you've massively misjudged? Is there any wildly successful business where you'd go back, and think about the pitch, and feel like you missed something or should update your belief system? James: Constantly. You know, the whole set of low-code no-code companies that I sort of dismissed. I don't know if you remember this conversation. There's some point when we chatted, where I basically said that, ""You know what I really believe in? I believe in domain-specific languages. I think that DSLs are a much more powerful way to express business applications and the possibility for business applications, than all these low-code no-code things."""" I was totally wrong. I entirely misjudged the the value add of making something easy and the way...in part of my head, I was like, ""Well, a developer's valuable not just because they can write things in good syntax, they're also valuable because they have to think through complicated ideas, abstract them, and come up with a good code to actually build something, to get something to work."" What I misjudged was that there's a whole set of low-level GLU things that people need every day, that are super easy to do, that sort of fall right under the cusp of really scary programming. in. So, that I totally misjudged. Lukas: One topic that we've actually never talked about — but I kind of wanted to use this podcast as an excuse to ask you — is, I'm curious what you think about AI and consciousness. Can you picture AI becoming conscious? Is that something that you think you could imagine happening in your children's lifetimes? James: What does that mean? Lukas: Could you imagine that there's an ML system that gets to the point where you would not want to hurt it? Where you cared about its wellbeing. James: There are a couple of different angles that I go on with this. I think that's true right now. I feel bad when I do lots of things to anthropomorphized...I feel kind of bad when I drop my phone. I feel really guilty and I feel kind of bad about it Lukas: For your phone? James: For my phone, yeah. I think there are lots of ways that I, as a human, sort of assume human-like characteristics to almost everything, right from the weather to my camera, to the screen, to some computer program. I get irritated. Why do I get irritated with Chrome as if it's an actual person? It's just a bundle of numbers, right? I actually think that we're there already. I actually don't think that my willingness to imbue moral worth or value to non-human things is something that's out there someday, but actually something that we do all the time right now. And then, although I am Christian — which we've talked about before — I don't really take a magical point of view on consciousness. I think consciousness is controlling what I pay attention to, and the continuing [?]? I mean, I both value it. I think it's really important. And it's an incredibly important organizing principle, obviously, for me day-to-day. I kind of think that lots of things are conscious already. That they already figure out ways to direct attention, and organize, and also tell stories about themselves. Lukas: Does your Christianity not inform your thoughts about consciousness at all? James: It totally does. But I think there's a little bit of this angle where I think that the things we learn about the world or science constantly shift, and so I'm actually quite open and willing to sort of adopt and adjust based on how we end up changing our view of the universe. I don't know, does that make sense? Lukas: Yeah, totally. James: Is that like a coherent- Lukas: -but I guess this is the thing that always makes it concrete for me — that I was telling you I had to ask you, and I don't know how you felt about it — but I always am curious if people would go through that Star Trek transporter. If you saw a whole bunch of people go through a thing that disassembled their atoms, and put them back together somewhere else safely, and you're convinced that it would work, would you subject yourself to that? Would that alarm you, or not? James: I have contradictory impulses. I get carsick, and I get woozy standing up...walking over a bridge. So I'm sure there'll be that trepidation. But isn't there also this view? When you think about yourself right now versus yourself 10 years ago, a bunch of the atoms have changed, have been replaced, right? In some ways, we are going through this slow motion transportation. In some ways you're just speeding up that transformation, of the rearrangement of those bits. I probably wouldn't be the first person to do it. But, you know- Lukas: -you'd be like the hundredth? James: Meaning that I would not necessarily have some deep, ethical, mystical reason to be concerned about it. Because I kind of think we're going through it already. Your set of atoms...are you your set of atoms, or are you that pattern that your atoms are in? In some ways, you're the pattern. Lukas: Interesting. I'm not Christian, but that transporter I think makes me more nervous than it makes you. James: But isn't it true though that you...if you thought about you current material composition right now, the literal pieces of it have changed pretty substantially and will continue to change. Right? Lukas: For sure. James: I just gave you my most tech-positive version of it, but sure, you're asking me tomorrow if I would do it. I think, ""Well, a little scary, let's find out."" But don't you also believe that you're your pattern, rather than your actual...like, who you are is the organization of these things inside you, rather than the actual substance of it. Lukas: That's true, but I feel like I'm going to experience the world through somebody's eyes, and I think I am concerned that my future self might not be inhabiting the body of the person that comes out of that machine. But my wife strongly disagrees with my point of view on that. I can see both sides of it. I'm just pretty sure that I just wouldn't do it, no matter how many people went through it and told me that it was safe. James: You say that now, but I will just remind you that our ability to adapt to circumstances and to change expectations is pretty dramatic. There are plenty of things you do now that are super...like, it would be super weird to you from 1999 or whatever. You're really young, too, but you know what I mean. Our expectations around what's normal or not normal shift consistently. Lukas: Like staring at a phone all day long. James: Yeah, seriously. Lukas: All right. Well, final two questions. One question is, what's an aspect of machine learning that you think is underrated or underappreciated or under-invested in? James: I do think all of the HCI social system stuff really is under-invested in. And I think that there are lots and lots of opportunities. It's interesting to me that the tools that annotators get right now are still so bad. It's interesting to me that the tools that data scientists use in some ways have not really changed since...remember your friend Cahir who wrote that paper, like 2013? Look at his paper in 2013, it's like the tools in some ways have not changed enough, right? So I think there's lots and lots of opportunities there. And I think there are lots of opportunities in making mainstream or more generalized...to generalize from the lessons we learned from human-in-the-loop. I think calling things human-in-the-loop kind of was a mistake. There should be a better name for it. And if we had a better name for it, then everyone would think of all their jobs as human-in-the-loop. Because I kind of believe that. I kind of believe that in the end, if we're successful, every process will be slightly better understood, and could be consistent, and get consistently better. Because our job as humans were to either figure out edge cases or create broad clustering so that we can be consistent. Lukas: So you care about the interface of humans and machine learning, how they can work together? James: I think that at multiple levels, at the level of the person making the initial decision, at the level of the person learning from that, at the level of the people controlling that, at the level of people benefiting from that; I think all those things...we're still in a world where so much of that is siloed. The way to think about it is siloed. And I think the ways to unlock lots of business value — but also to be honest, just straightforward, good things for humanity — is if people had, at all levels of that game, a bigger view of what it is that they're engaged in. Which is sort of a great game of collective intelligence. Lukas: All right. Practical question — which might actually have the same answer. It's never happened before as I've asked these pairs of questions — but when you look at machine learning trying to get adopted and deployed and useful inside of enterprises, where do you think the bottleneck is? Where do these projects get stuck? James: I think they're so often badly conceived and over-promised. You know, we joked about this in the middle of this; I am still kind of convinced that if we offered your exec ad class to every senior executive in the world, that we would basically all make much better decisions, and we'd end up with much more successful implementations. So I think that that part is definitely true. And I also think that the other thing that's holding us back is we still don't have great methodologies for thinking about how to build these systems. That we are still...in software development world — someone just gave me this history — random coding becomes engineering when NATO decides that it's an important thing in like 1968. And then we codified all this waterfall stuff, right? It goes from waterfall to extreme to agile over the course of the last 40 years. And what's interesting to me is that that methodology, I think, is mostly wrong for building machine learning models. We are still shoehorning these projects as if they're software development projects oftentimes, and thus wasting a bunch of time and money. Lukas: Awesome. Thanks James. James: Okay. Take care. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out. So James, here's what I really want to know. How does your religion inform your thoughts on machine learning? James: This might be both borderline kookie and heretical. We'll just caveat it first that way. Lukas: Fantastic. James: I think that there are a few different angles. I think the first is that, at least in my theology, part of godliness is the act of creation. And I think that there's a way in which, as an investor, I put faith in the act of creation in helping people make something new. So that's one part. And the creation of however you want to talk about machine learning, I think there's this sense in which the models that we're building in some ways have inherent worth and dignity, as basically sub-creations of people, right? That we are creating something new, and — whether you want to call it life or whatever you want to call that thing — that it is something fundamentally new and different and interesting. That piece of it then informs the way I think about both its capabilities and why it's important, but at the same time — this is the part where I think other folks might have trouble with this — I do believe that we're fallen. I believe that we...I actually think that we want to be good. But we're actually bad. And I think that anything we create in some ways has tragic flaws in it, almost no matter what. In that way, I'm actually much more both forgiving about people, but also institutions, but also the models that we make, right? These things that we're making have great beauty and potential, but they're also tragically flawed because we are. Lukas: I love it. Awesome. Oh man, that's definitely going on the podcast. That was great. James: It's kind of plausible, right? It's not crazy. Lukas: I agree with all of it. Yeah, totally James: I think we oftentimes all think we're good. I mean, we think we're good, but we actually kno...it's not that I'm good. It's that I want to be good. And I'm just always doing stupid things Of course the things I created are going to be imperfect. That means that there...it also means there's this constant chance for improvement, which is the core of the understanding of gradients.",10831 +"Boris Dayma — The Story Behind DALL·E mini, the Viral Phenomenon",https://www.youtube.com/watch?v=vxc8FKqQxGM,2157,2022-06-17,yeah i don't know if you've seen that account like weird delhi they have like crazy images you know that they that they put on like the demogorgon from stranger things holding a basketball that is like insane that is like so cool and you know you go through it and there's like amazing things that are like that are like so fun you're listening to gradient descent a show about machine learning in the real world and i'm your host lucas beewald borisdema is a machine learning consultant and a long time weights and biases user in fact he started using us in such early days when i knew all of the users boris has gone on to build a model called dolly mini which is a model inspired by open ai's famous dolly project and somehow dolly mini has just captured the imagination of the world to the point where i've seen it in new yorker cartoons i see it all over my facebook feed it's just an amazing piece of work and i'm really excited to talk to him about it today so i mean maybe taking a step back for people who maybe aren't familiar with dolly at all could could you talk about what dolly is and what dolly minnie is and and dolly megan how you've been working on this yeah yeah so delhi is that uh that um it came from that paper from open ai of the beginning of last year and it was really amazing and actually you know the first time i saw it was a tweet you posted about it and i replied to that tweet and i was like i'm gonna build that and i was like this is so cool and i was like i'm gonna build that i want to build that and um and it was basically open i had done the first like impressive impressive image generation model that you would type any prompt and you would do something that looked actually cool because before you had like a image gpt or whatever which would do like something very tiny you have a bit the idea but it will do something more complex like the avocado armchair which was cool at the time nowadays the avocado i'm sure it's just something simple it's nothing impressive anymore it's crazy like a few months ago i was still very happy when i had good avocado um yeah so that's where it came from and um and basically like um in july of last year i i wanted to build it i didn't do anything for six months um i read the paper quite a few times i don't know how many times i read it i read it like a ton of times and i didn't understand much to it and um and at some point like um hugging face and google they organized like a hackathon where you had to build something cool in jax which is like a programming framework from google and uh you would have those cool computers from google those tpu uh vm and i was like okay that's an opportunity to do something cool so i was okay i'm gonna try to build uh to build the replication of delhi um in terms of the results and uh it turns out that it worked pretty well you know i i researched a lot of the papers some people join in the team too and uh somehow the the program had pretty cool results for such a short time frame and then i continued can you describe how the program works i mean it really feels like magic like how is it actually set up i know i know it felt like magic for me for so long you know and i think even if i read the paper again now each time i read it i i learn new things um so basically the way it works is you you you have like good model right now for nlp which uh transform text into text so for example let's say for um let's say for summarization or let's say for translation going from english to french and you are trying to do kind of the same thing except instead of going from english to french we want to go from english to an image but it's it's almost like the same it's just a translation um so the way you do it when you do like when you do text to text you encode each text become a token it's just a number a unique number and you try to predict that sequence of number and each number corresponds to some text and we try to do the exact same thing except that each number corresponds to a patch of an image so that's all it is so so first you need to try to create an encoder that's going to transform that image into a sequence of numbers the same that the tokenizer for text would do and once you have that sorry can you slow down a little bit for me so oh sorry you you have an encoder that's that's taking text right and turning it into some kind of encoded vector yeah and the vector goes into a decoder that creates the image that's right but what did you say about the batches or yeah no no that's exactly right except typically when you have the the decoder it would create a number that corresponds again to text now you want the you want that number to correspond to some kind of image oh so right so you could try to do a rgb like the pixel values the premise there would be too many so it wouldn't be very efficient that's what uh image gpt was doing and that's why they had like maybe it was 16 by 16 squares or so it would be very very small and it's very limited but instead what openly i did uh did each number correspond it to a patch so for example one number can correspond to a green patch one number can correspond to a blue patch with a yellow dot in the center they all correspond to something more complex i say so so you train basically a separate model that's completely independent and that's trained separately and frozen later that learns how to create those patches and the goal is you wanna because you're limited in vocabulary maybe you can create only ten thousand different patches uh sixteen thousand or something like that i think it's what is used commonly so what you do you want to you want to create the patches that are going to be used the most often the one that are the most relevant so you have a model that basically is trained to find those patches it looks at a lot of images and tries to encode it into into a code book where in the end when you reconstitute the image it's as close as possible to to the original ones once you build that once you're able to go from image patch to a number it's the exact same thing as doing a translation and are the patches in like a grid like a two by two grid that's right they're in a grid so often i think for me the picture is divided each patch is about like a 16 by 16. oh wow it's amazing you don't notice that yeah those are big patches so it's it's not completely independent patches because it's a unit so with the convolution you know there's a bit some overlap in the middle uh which which makes sure that you don't see those patches i see yeah so they're kind of blended together between exactly i see yeah and then interesting and so is it so it's the same as as like with a like from the attention is all you need paper with with the like attention um encoder is kind of identical to that translation model is that right oh yeah yeah it's it's very similar so delhi what they do they have just it's just gpt so you have gpt it reads the text then there's like a special token maybe i mean i don't have exactly the details because the the code is not released on the paper but the paper is very detailed but it's missing a few of the small details so it's like you have all the text and then you have the encoding for the image at some point and so it predicts and at some point there must be a special token that says hey now it's an image so it's switched to another modality um me my model is slightly different in the sense that have the encoder it's like a translation model where the encoder and decoder is separate so the encoder will read the caption um and then the decoder will do the image and will be causal and the idea was that the idea was that it would be maybe more efficient and now with like a 1d like string of text i kind of understand how you feed back each individual one into the decoder and it looks at the previous um words but how does that generalize to like a 2d image how do you do that so the 2d image actually you just put the patches next to each other you don't consider the 2d so it's actually an issue too because basically when you when you predict the image it's almost you predict from top left to right um to bottom right you know oh is that right just like a line by line yeah yeah that's what is done so so that's a prime because if you mess up at some point it influenced the the entire rest that that's a limitation that diffusion models don't have for example and now like how do you actually train this because i would think it would generate images that would typically be just totally like you know each image actually is quite different from like an rgb value so so how does the training work so the training the way it works is like um the image you will first encode it into a sequence of numbers and then your your text will be also a sequence of numbers typically when you have your tokenizer and uh and your input basically is going to be those numbers for the text and the output is going to be those numbers for the image but the so the the image encoder you you don't you it's frozen at that point you don't train it so you just have you just go from numbers a sequence of numbers to another sequence of numbers so then what you do um you have that decoder it will predict the logics it will predict what are the numbers for that image and its decoder that goes from one to one and and the the attention looks only at the previous uh the previous patches and um and it will it will predict it and then you have the ground truth because you know what is the real image so then you're gonna calculate cross entropy you're gonna calculate the loss the same that you would do with another problem to see how wrong you were so basically it's kind of strange because you have a caption that would say a cat in the field and then you have a grand truth image and basically the goal is to predict that ground truth image um it's a bit controversial that it works because there's so many possible cat in the field right uh and it could be right and you try to predict only one specific one which is that specific image you're looking at right now uh but somehow it works somehow it learns to eventually you know to to associate concept to minimize the probability like the way to the simple way to think about it is let's say if you want to predict um a view of something by night or the view of something or the beach during the day that the the model will try it will know that okay it should probably be dark images at the top it's gonna be black um because it's the night or it's gonna be very dark blue so i'm gonna just increase the probability for those tokens and if you say during the days like it's probably going to be blue maybe there's some sky up there and um and then when it outputs the next tokens it already knows the previous ones so you know you have part of the image you need to predict the rest it's it's easier to to to predict oh so when you're doing the prediction you're feeding in part of the ground truth image and then trying to complete it so you're feeding the entire image but basically each token sees what is before itself so for example the first patch doesn't see anything but the second patch sees the one that's before not its prediction it's the one that actually was before uh sorry the second patch token sees the ground truth patch not the predicted not the first predicted past you're right the second patch sees the grand truth of the first patch and like let's say when you predict the second bottom half it already knows the exact grand truth of the first top and it also sees the prompt so you know it's the prompt also is supposed to help right but i guess i would think like if you were just like say you know a cat at night or something if i was gonna guess like just a whole image of that that i want to be closer from like in rgb space yeah i could imagine it might be better just predict like gray than like drawing a specific cat a specific point yeah yeah that that's actually a huge problem that uh remember i was playing with colorization before and yeah you you predict gray is like the the the best the most chance if you don't know anything just output a grey image right but but in that case the loss is cross-entropy on tokens so you don't have an advantage in predicting gray because you have the same loss for being correct or for being wrong depending on the color being same gray or blue while it's red you have the same loss there oh so you're picking the probability of the next token and there's only a set of i see yeah i see so so it does you don't have the program that it's gonna it's gonna show like a black and white images or it has to predict something and a color is it may as well i say what was it like to to build this like what kinds of issues did you run into i would think this would be one of those things where it's just incredibly hard to debug when the program's not working oh yeah yeah it is it is quite hard because you don't know why why it's not running and and what's happening i think you know what what was good is the the first version the dali mini maybe we were lucky somehow it worked well pretty fast um and you know we thought a bit of those details you know like seeing the last two can you can have a little mistake like you have an offset of one token or you have little prams um and it doesn't work and we use the pre-trained model so we use just a summarization model and we decided okay the decoder returns from scratch now it needs to predict the the images and what was good is it actually it actually worked pretty well fast so when i um when i worked later more on the larger model um you have always a baseline you know you know is it getting better or worse and but there was some bugs that i spent two weeks or more to to understand what was happening why it doesn't work can you tell me about one of them one difficult one was like um let me try to think of an interesting one um an interesting one was like uh i was trying to use a alibi which is like the way you encode the position embeddings no no sorry i messed up i was trying to use syncformer so think farmer is is a certain way of adapting the transformer where you normalize uh your normalizer.ok and all but it's used for encoder models but for decoder basically you don't realize it but you have some information that leaks through the next token so so i would have my loss that was going very well very fast uh to to close to zero and i had no clue why and it's just when it normalizes it gets some information of the future and that was really hard to to understand why why that type of model didn't work but but the biggest challenges were actually you know when you make it larger training those large models is very very hard and um when i first started the small model i was okay it's gonna be easy to have a large model i'm just gonna add more layers and uh and train it on bigger computers or for longer maybe a bit more data and and it's just gonna run but uh but unfortunately it didn't work and that was like that was a bit sad and first you have to you know be able to split the memory well across all the devices it's not very easy because one model doesn't necessarily fit on one device so you need to spread the weights across the other devices jax has cool features to do that by the way that was very very helpful to do that but then you know that you have the model that becomes unstable you have peaks that happen randomly you know it starts well the first hours you have your loss that goes down and you're so happy and then suddenly you have like big peaks and like oh okay fine i'm going to restart before and it's like okay now it went through but like five minutes later another peak and it was like really really hard um on that level something that was cool is like uh as i was training it i was making my uh my reports you know on weights and bases and i was sharing them online all the time since i was working on that and what school is like uh the the community on on twitter and all got very engaged with it and they were like so helpful you know actually i don't know if i would have been able to to build it to the with that success without having shared all the journey because like i had like few key elements that helped make the model better and that were shared by by replies on twitter like oh maybe maybe you could try that like the optimizer we use like distributed shampoo and it just came kind of randomly through twitter or super air conditioning from from catherine cross and it was like a bit random things that i discovered along the way just by sharing um that training uh publicly so it was very beneficial for me that's amazing so you were sharing with unbiases reports on twitter and getting feedback oh yeah i was getting feedback and i think little by little there was interest i was showing the the new predictions and then i was like okay i wanted that a little demo online so people can play with it a bit and some people are already engaged like oh yeah i see it's better at that or it's bad at this and it would give me idea on what to correct and having it open um helped me a lot because in the end it's almost like it feels like it was not just my work um i got like free advice from everybody too so it was it was really good it would be fun to see all the weights and vices reports i wonder if you could like make a collection of the history i did yeah ah cool yeah if you open up a main report basically and a link to all those reports uh i i'll link to all the main ones if i had to link to all of them uh there would probably be like 50 reports or more you know and there's some i did just for myself that i wouldn't share because they were missing like the things like sometimes like oh yeah you have a conclusion you do some tests and like okay it's important maybe to use dropout or not not to use all weight decay has no effect but you forget it when you do so many experiments you forget why why you don't use it and so i would always go back to my report i know i know i had those experiments somewhere and and i would type a line like oh here this is the runes those are the ones that show that you shouldn't use that and uh it was actually very convenient for me to see why i take some decisions or why the previous code run but not now what was the difference between those turrets um it was like it was major help that's really cool do you feel like there was enough in the the dolly paper to really reproduce it or did you feel like there were things you had like he thinks he had to learn along the way to really get the thing to work so the daily paper actually provides pretty clearly the main ideas and the main ideas like there's the last model we didn't talk about but there's one model that's going to encode the image the one i use is not the same as the dahle is actually one from timing transformers some other people adapted it and added some gain loss in there some perceptual loss to to make it a bit better so the premier it creates those weird artifacts that we have sometimes on the faces and all uh versus the original one what it would do it would do something blurry um so and then there's that that model that needs to predict the next tokens so for open eye it's like a kind of a gpt model me it's more similar to bart which is encoder decoder um because i thought it could be maybe more efficient and then they have that clip model that actually was released there is little by little a model larger and larger but uh that's the model that has like revolutionized a bit uh a bit um a lot of the um a lot of the research in multiple modalities and that include like text and images and now audio people are adapting it for 3d but it's that model that basically it will tell you how well it learns how well a text on an image match so it will give a score you give a text you give an image and it will give a score based on how well it believes the match so for example when you use the demo um we output more images than the nine that you see we have to put maybe 16 maybe more if there's not too much traffic which never happens nowadays um but then over the 16 we have clip look at them and it chooses the nine that you think are the best and it actually improves the quality quite a bit oh interesting so there's outliers that you don't get to yeah the ones that are really bad typically it will find them like no don't don't show them it's not about it but uh so so the paper actually has those ideas which are the essential ideas then it's missing some details on how it's trained and there's a lot of details there's some things a bit that are missing um but overall it's a good base enough to to build something i wish you know like i could have just run the code immediately and train it actually you know in a way in a way i'm quite happy hard to say it in a way you know the fact that it was not released pushed me to motivated me to learn how to build it i don't think i would have learned um how it was built if i didn't have to try to build it myself that's cool i mean how sensitive is the performance to the details of how the model works in your opinion like i always wonder this i don't know if you have a thought here but like when i look at the attention mechanism you know we tell ourselves a story about you know the three vectors that get generated and how they're multiplied together oh yeah but i kind of wonder like you know how much do you think that the specifics of that really matters so i think there's some details that whatever you put it works you know it's like uh when i started to do machine learning i know you have to do a covenant to detect cats and dogs and i remember i was like oh i'm gonna try to build my own model and then you're like what depth should i put how many layers are like i'm gonna put 12 here and then i'm gonna put 36 and then i'm gonna put less and more and i would put random things for no reason and whatever you do it works you know like a simple model whatever whatever you do in the end it kind of works um i think you know there's so many configurations where it would work um and you actually don't need to bother too much on the larger model you know there's some scaling laws and there's a bit of research and it's kind of hard to know what what works what doesn't work but i tried to follow them a bit because i'm like okay some people tried a few things and i'm gonna try to to use you know the same ratio they have of wheat versus depth and and then i have a report where i tried a lot of variants like transformers there's like 100 different variants so i tried a lot of them and actually some converge some not necessarily converge better but were more stable for some reason so i tried a bunch of them and i picked the one that worked that worked the best interesting yeah some photo activation like the activation function i don't think it matters a lot i tried different you know you have initially when you trade hyper parameters like oh let me try different activation functions it it barely matters overall whatever you purchase it's okay but um but yeah maybe there's some advantages you know i had one maybe it had some noise and i was like okay i'm gonna take the one that that that was stable but maybe you know i if i had just taken another seed i would have had different results so i don't know how much i can rely on some of the conclusions cool what i mean what do you attribute to the model's massive increase in popularity like i um you know i think i we kind of noticed that our our metrics for reports are getting messed up because there's so much traffic to your um report what what do you think's going on yeah so i think you know somehow people are people think that that model is new but that model has been there for a year already almost um it's just like over time i worked on it and little by little it became it became better and actually the the traffic like the people using the model people involved in the forums and talking about it actually increased a bit over time but i think when i trained a larger model it reached like maybe a critical stage where suddenly it became good enough that good enough for like virality and like uh i think some maybe youtubers tried maybe for fun they would put their name and they would put their name in different situation uh like in the golf cart or whatever and suddenly they would see their face and they would see something that looks like them not really that was not very good but that kind of looks like them and they would put themselves in like the craziest situations and uh i think they got excited about it and little by little it amplified um but yeah it's just reached a threshold where it was good enough what's fun is like uh the model is actually still training a little bit so i'm curious of is it gonna be much better and there's still stuff to improve so it's interesting it already reached a threshold that uh interests people and there's easy ways to make it still better so how much data did you train this model on so it's probably i would say around maybe 400 million wow 400 500 million but you know the data is actually is actually very important and there's some you know there's some tricks here and there to try to make it work um when we train the model at first and when you look at a lot of open source models that exist that you that you can play with a lot of some problems that will happen a lot is like you would say okay i want a view of the snowy mountains and it would draw maybe well the snowy mountains and then on top of it you would write shutterstock right right you know and the model had learned that okay an image typically needs to have a shoulder stalker watermark which is like kind of horrible you know because that image was like completely new and you would have that horrible watermark on it so one of the first thing i did was like okay i don't want any of those images you know like how to avoid that so it actually it was a little prime but for example how did i solve that problem initially you know i looked online here how to detect if an image has a watermark or not um there was some things here and there but it was not that great and then um and then some people were trying to generate data set with fake water marks to try to detect it and it was like already a big challenge to do that and then i realized okay i can just remove all the images that have showed a stock in the url and that prime is solved and actually that was the that's the solution i took and uh and it works quite well because you you never see a watermark interesting and and i guess how long does it take to train on 400 million images so i think the what matters the most when you have a lot when you when we didn't have a lot of data initially um after a while the model overfeeds and even more it's a bit smaller you know the smaller model was was what 400 million parameters which is already quite big but uh after a while you overfit maybe i think after five or six epochs i would overfit and typically that was the equivalent of maybe i don't know two weeks on one single tpu vm um now when you have like so much data maybe more than what your model size can handle i don't know if you overfit that easily maybe you can but that's where you know you need to be very careful of having a good validation loss because uh there's that cool model that's called real delhi the russian delhi which is really really nice but when you use it it looks like it overfits um it's a bit i think it's not as good as composition but it will make really nice images it uses so it encodes the images also in a higher resolution than me so they have less of those artifacts but sometimes you realize that you will type a description and it will show an image that it has seen before so there's a lot of memory in that so the overfit and i think maybe they didn't have the right validation set and there's a premium in that validation set too which is if you just try to take um random set for your training data and you call it the validation set it doesn't work and the reason it doesn't work is typically let's say the google logo is present on a ton of websites so you have a ton of url that will have that will be unique that will have that google logo but maybe the caption is different so maybe if you try to be unique by image and caption because it's okay to have a same image with multiple descriptions but it's it's not good for your validation set because remember when you train it the pixel see the previous pixel so it will recognize the image it may ignore the caption and you will see a validation loss going down but just because it doesn't care about the prompt it just recognizes the images and just predicts it i see that makes sense so you're really training on just captions on images that you crawled on the web that's right that's right i would imagine that would introduce all kinds of crazy artifacts like do you get is it easy to generate cool logos of companies yeah it can so actually i actually it's good at creating logos and it's funny because uh it's something that made me happier a while back so some person reached out to me like hey my mom started a new business she couldn't afford to have a graphic designer i just used aluminium i gave her a logo it was good enough for me and i was like so happy that you know it was helpful in that way um so it can do surprising things and you know like something that i didn't realize would be possible and the fact that it's open and that so many people can use it that's why i'm learning so much more you know like i realized i was barely testing the model before you know i was putting a cat on a skateboard while people have those crazy prompts you know like my prompt i thought when i was putting the i-filter on the moon i thought i was creative but uh like it's it's ridiculous in comparison to what other people do that's awesome um do you have plans for what you want to do next yeah i think i think i have a lot of ideas i don't know where i would go next obviously like the the model that use diffusion are very attractive because they do very impressive images so it's definitely something i want to look at how does that work i'm not for models to do diffusion diffusion so diffusion the way it works instead of predicting the the image in one shot because here you predict the patch and you just predict in a way in one shot you you iterate many times and you have an image that's like initially just noise random noise so imagine like random pixels random colors and and you try to remove that noise little by little and you you go through it like 100 times or maybe a thousand times a high number of steps there's ways to try to go through it a bit less times but you go through the less the same model many times each time it removes a little bit of noise and at the end it turns up to an image that's actually cool so it's almost like i compare to me almost if it was like a recurrent model where you know you go through it but the fact that you just remove the the noise little by little it guides it to a loss that's very friendly to train um so it's it's it's like super promising and it's like i mean it's already proved to to work very well with delhi too and imagine so so that's something cool the problem with those models is they're a bit more um computationally expensive so we still need a bit of research on how to make it more efficient uh which i think is something interesting um so i'll probably look at that but but there's also you know my current model as many ways it can also be improved in cheap ways and fast ways so maybe i i'll try to to do that or a bit of fine tuning you know functioning it on your own art or your own data i think that could be pretty cool and and then there's people you can use the same model to maybe generate sound or music or video you can do like so much with the same type of model so it's pretty exciting what what where it's going i think it's going to go very fast too is there any um feature we could add to weights and biases to help you with your work to be fair i feel like i've been using all the features with that project you know i have the i have the pipeline the model is trained have the checkpoints i resume from those it's all tracked when i do inference i cannot do i can't do inference during training because it would be too expensive i need to load that that image decoder model so you know it would be inefficient so i have another machine that does it that that's linked to the to the checkpoint so i have let's enter a pipeline you know that's set setup and that does regularly some inference you know uh what could be added are you using alerts like for when your model starts to go badly i i should because i look at the training way too often you know on my phone you're like you have a little pause or i go somewhere i'm walking and i'm checking quickly is it still training or did the tpu crash or not is the loss going high um maybe alert would make me feel uh feel more relaxed well i have to tell you i it's been really fun to watch my friends who aren't in machine learning talking about dolly minnie and and being like i know boris you know i know the guy that made it and they're impressed so um congratulations on uh such a successful model it's really captured over his imagination thank you that's fun i like that uh i think it's cool to see so many people using it i was a bit scared that uh you know because you could see it in negative ways too or it's creating images but overall the reaction has been pretty positive people are happy that they can see um through that model what are the limitations and biases and what can it be used for and they can test it out themselves and if i had to figure out the limitation on basis myself it would be impossible so it's actually i think it's actually really cool and i like that you know it's used for for people who cannot draw at all like me it's kind of cool because even if the image is not that pretty it's still so much better than what i could do but for people who are actually talented it's i'm happy to see that some people they use it as inspiration to decay this was the output from dali mini and then they use photoshop and they do something crazy out of it and i think it's it's really nice to see that it can also be used that way awesome well thanks so much boris it's great to good chat thanks awesome chatting with you if you're enjoying these interviews and you want to learn more please click on the link to the show notes in the description where you can find links to all the papers that are mentioned supplemental material and a transcription that we work really hard to produce so check it out,6921 +Tristan Handy — The Work Behind the Data Work,https://www.youtube.com/watch?v=A7ktaG8qGFs,3648,2022-06-09,"Tristan: The thing that dbt does is try to get to a ground truth that everybody inside of an organization can agree on, so we can at least have productive disagreement. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Today, I'm talking with Tristan Handy, who is the founder and CEO of dbt labs. dbt, for those of you who don't know, has gone from an open source project to one of the most critical components of the modern data stack in under four or five years. It's been incredible to watch from the outside and I was excited to talk to him about it. Lukas: You probably are the first person that isn't kind of actively working in the ML field, but data is so critical and tangential that I thought you'd bring a really interesting perspective; you might need to make it a little more basic, you know, for our audience. So I thought I would start out by asking you to describe what dbt is. Because in your world, it's a really famous, well-known product, but I think for a lot of ML people, they might not even know what it is. Tristan: Yeah. Gosh, I sometimes get challenged to answer this question from a ""Imagine my aunt is on the other end of the conversation,"" and it's really challenging. Lukas: I know the feeling. Tristan: Right? ""I do data stuff"". So, if you're in kind of more traditional BI and analytics, your world has changed very significantly over the past 10, 12 years. Really driven by the rise of the modern cloud data warehouse. Now everybody has access to high-performance, scalable, SQL-based compute. You can just throw data in there and, by and large, queries are just fast. And there's this whole ecosystem of stuff that has risen to get data in, to organize the data that's in the warehouse, to report on it, et cetera. We had to kind of...the whole industry had to kind of rebuild all of its tooling around the cloud data warehouse, because the way that stuff worked before was all constrained around speed and size of data. The problem that dbt addresses is that now there's this massive profusion of different types of data that show up in the warehouse. You get Fivetran or other products that... their whole job is like, ""You push a button and now a whole new data source shows up"". But it shows up in exactly the format that it lived in, in the source. So you connect Facebook Ads and you get 150 tables that map one-to-one to like a Facebook Ads endpoint. And so then you — as a data person — you need to figure out like, ""What the heck is even in there? How do I organize this in a way that is useful to do some reporting for my end data consumers?"" dbt creates a new workflow around how to do that work. It's very code-first. It takes DevOps principles as kind of its founding ideas. And it is open-source, open-core, first commit happened about six years ago. Over the past six years, there's been a pretty large community that's grown up around it. Lukas: And I guess what does it actually do to address that problem? Tristan: So, how was this problem solved before? Before, you couldn't rely on the warehouse to do all this work because the warehouse was constrained. So you had these intermediate environments. A lot of times, you know, you had commercial products that sprung up to do traditional data transformation that happened before you loaded it into the eventual warehouse. The big insight behind dbt was that the warehouse now is performant, scalable enough to just do it all itself. And what that means is if you want to get access to the compute that lives in the warehouse, you had to — at least traditionally - you had to express your jobs in SQL. And so dbt is essentially a framework for programming...to build data pipelines in SQL. You write in the SQL that is native to your database, and then it has a Jinja layer — a templating layer — over top of that SQL, that instead of just having this collection of random SQL scripts on your hard drive, you have a framework that you can plug into. You have references, so you can build DAGs out of SQL files seamlessly. You have environment variables, you have CI/CD, all of these things that you would expect from a programming framework. Lukas: It's funny. I feel like from my perspective, running a tech startup where I'm trying to get official records of data on all these different topics, it seems incredibly obvious to me that something like this is needed. But I wonder if my younger self- Tristan: -do you want to do the big reveal of how you first got exposed to dbt? Lukas: Oh, should we go down that path? Sure, let's do it. You were one of the very best consultants we've ever hired. You came in and did our analytics, and it was funny because I actually edited your SQL queries that you wrote, quite a bit. And I should say I learned a lot of SQL from you. I felt like you were the first time...working with you was kind of the first time I saw...I mean, I kind of learned SQL as a side thing in school. And then I used it a lot, you know, as...I think when you're CEO, SQL is the language that you end up writing the most stuff in. And so I think I kind of went down a bad path. SQL kind of lets you start to write it like you might write blocks in Excel, where you just start stuffing more and more chaos into your queries. One thing that's actually notable about working with you is you really pulled out each piece into its own named section, which I didn't even realize some of those things you could do in SQL. Tristan: I mean, generally either you can't at all or SQL doesn't make it easy. I think you experienced part of the magic that data people experience when they use dbt for the first time. They're like, ""I've been...for however long I've been using SQL, it's looked like garbage and you've given me some more structure in the language and I can now engineer it in ways that actually make sense."" Which is...for many of us, we're used to thinking in OO, or functional, or whatever, like programming paradigms, and then SQL becomes very frustrating because you can't actually organize your code in these similar ways. Lukas: Yeah. I think from my perspective, if I could write a love letter to dbt as someone who doesn't actually use it, but sort of sees the results of it on my organization...you might not realize how much complexity enters your data pre-processing. We have a lot of people that come in and use our product as students, and there's sort of different ways to get at that. But we often want the students sort of outside our analysis of leads for sales, from that perspective. And there's a lot of different ways you could kind of cut who's a student, but it's really helpful to have one official way that's really good, and just kind of nail that down, and then let everybody operate off of it. It feels like one of the big benefits of dbt for us at Weights & Biases is that we're able to kind of standardize all these intermediate steps and have an organized way of...yeah, just standardizing on these things, which I think has made us operate much better as a company. Am I off on... Tristan: No, you're totally right. The way that we talk about this is curating knowledge inside of an organization. It used to be that, like, in our wetware, we used English to pass knowledge on to each other. And then somebody would write a SQL query for themselves based on their kind of imperfect understanding of who was a student. And now there's a way to actually take that knowledge and encode it. Then you can just forget about it as an organization until you say, ""Hey, how do we do that?"" And then you just look back at the code and you can even look at the git blame and you can say like, ""Well, here's how we arrived there."" Lukas: Right. And if you don't do that, you end up with all these different — slightly different — versions of ""what's a student"", and it doesn't match, and it's totally bug-ridden. I feel like dbt has made a big difference. Tristan: Here's the funny thing about that project from my end, from my experience. I was working with you folks and you know about machine learning. And it's so cool, and trendy, and you can do magic stuff with it. And here I am. I'm close to the business, I come at data from a...I understand the business, so let me get into the ""asking questions about it"" perspective. To me, what I do feels not that complicated. I mean, at least not that technically complicated. And so it feels like people who know about something as - you know, this is my internal monologue - something as complicated as machine learning, how can they not have this kind of basic stuff figured out? But I think that...for whatever reason, it's not a thing that ML folks have widely dem...I have my theories on why this might be true, but I'm curious if you have any thoughts on why that might be true. Lukas: Well, I think one thing that shouldn't be underestimated is that most people in ML have a lot of academic training. Like, just a lot of ML comes from academia. I think more than almost any other field. Knowledge gets passed down in academia quite a bit. It's starting to change, but I think it's still, you know, people are going to original papers and kind of learning through professors, and that. And I do think that academia teaches you incredibly bad habits, right? I think everyone kind of coming out of there has to unlearn a lot of things, including myself. Because if you think about academia, you're trying to get to an interesting result and then you never have to make iterative progress from there. Whereas in work, most of what you do is iterate. And so, you really want things to be kind of stable and contained and clear, whereas in academia, most of what you do gets almost immediately thrown away. You're sort of racing and kind of not...you write a lot of throwaway code. You don't think a lot about structuring your code and then you especially don't think about making your data pipelines stable and consistent. Because I think a lot of ML- Tristan: -which is how Jupyter notebooks end up becoming data pipelines. Lukas: Exactly, exactly. I totally understand how it happens. And as a CEO of a growing startup, it drives me nuts, right? But I actually kind of come from that lineage too. So I think I've had to unlearn a lot of these instincts as well. I also think it's actually like a real skill - that I'm still working on - to make good data pipelines for a company. Every query is more complicated than you think it is at first blush. And I think a lot of these choices, it's harder to do extremely agile iterative development. A lot of these choices that you make have long-lasting repercussions and need to be considered, and it's more important to get it right the first time for some of these things. Tristan: We started - when I say we, I don't mean the company, but the dbt community - started using this term ""analytics engineer"" for the people that use dbt and...or like do their work in the way that dbt teaches you to do your work. And I think it really gets to this dichotomy where...there are data analysts who use the tools of data analysis to come to some net new results. And in that world, it's actually completely fine if your code looks like garbage, if you can't...it's just like, ""Poke around until you find something interesting and then wave your hand and be like, 'Hey, does anyone else find this interesting?'"" Whereas analytics engineering is this thoughtful effort to slowly construct reality for the business. My favorite example of this is actually...this was a consulting project. I was working for a full-stack grocery delivery company. And I had to help them calculate the Cost of Goods Sold for, like, an individual batch of green onions. It turned out to be an incredibly challenging problem. And in many ways, deeply unsexy. But it was so fun to me to like...now, every single time a picker picked a thing of green onions out, we knew exactly how much cost to allocate to that. Lukas: Yeah, totally. And I think it's funny how CFOs come from a totally different lens from ML, where they really want things to be precise and accurate and consistent and traceable. I mean these Cost of Goods Sold calculations always end up being...to get it that precise — which I understand why Finance wants that — is often in deep tension with the sort of exploratory data analysis that's also important. Well, I had a question for you that I really wanted to ask. Which is, both of us run companies that are kind of hard to explain to our aunt or uncle, right? Kind of behind the scenes in helping a lot of things happen. But I think one thing that both of us share is we really are passionate about the impact on the world. And we're kind of in this maybe, you know, more for the impact than the financial gain. I don't want to put words in your mouth, but that's my sense of you. I'm curious how you think about the impact of the work that you do or how you articulate it to prospective employees or the world. Tristan: Yes, I agree with you and I'm like ""Okay, let's go there."" But actually no one asked me this question. From a commercial perspective, our mission is to help data analysts and help them curate and disseminate knowledge inside of organizations. But if I broaden the lens and think societally — and there's a lot of tech where we like to talk about making a dent in the universe. I think that's overplayed sometimes, I try to be a little more humble than thinking that we are going to somehow impact the trajectory of the universe — but when I frame it like that to myself, I am deeply concerned with our epistemic reality as a world today. We don't need to go too deeply into politics, but there's been a lot of interesting conversation happening at the national or international...this is not just associated with the United States, but where people disagree on basic realities of what is true. And because of that, we actually have a hard time having conversations or having productive debates. Maybe some of that's in good faith, maybe some of it's not in good faith. But whatever..the thing that dbt does is try to get to a ground truth that everybody inside of an organization can agree on, so we can at least have productive disagreement. I don't know that there's some way to magically organize all structured information in the entire...okay, maybe that's beyond what we will ever get to as a company. But it does motivate me to think that the world that we are working on, figuring out the epistemic reality inside of organizations is actually a big problem for the entire world right now. Lukas: Interesting, great answer. Tristan: Is that what you were expecting? Lukas: No, not at all. It's a really interesting answer. I'm just contemplating it. I think it's a great way of looking at dbt. I always don't want to be the caricature of a startup CEO saying, ""We're changing the world with better MLOps,"" but at the same time we are changing the world with better MLOps, and I do feel proud of it myself. I don't want to come like a blowhard, but I also do feel really proud of the work that we do. I think it makes a small dent, you know, a small dent in the universe. And I don't want to be falsely humble either, when it feels good to help out all these customers working on really, really exciting things. But I think you have such a specific, interesting answer. That's such a great way of looking at what dbt does. Tristan: Talking about the customers building cool stuff, there's this funny conversation going on inside of our community these days, where a lot of folks who used to be practitioners have gone over to the founder side. They've gone over to the dark side. So when it used to be all of these practitioner-to-practitioner conversations, now it's a bunch of tool vendors hyping their own stuff. I'm a little bit jealous of...I would love to actually go back to the other side of the fence. Maybe at some point we'll get the opportunity to rejoin the people who are actually using the shovels as opposed to making the shovels. Lukas: I guess here's another question that I think about a lot. How do you stay current without working on this stuff? For both of us, I imagine it's important to keep doing a little bit of the task. It's very hard for me to learn about machine learning in theory without practicing it. I'm always really trying to carve out time to train new models and try out new things that are coming out. But, you know, the urgent needs of running a fast-growing company encroach aggressively in that time. How do you think about that? Tristan: Yes. This is something that concerns me a lot. I think that I might be in a slightly easier position than you. You can summarize a lot of the characteristics of our world based on the evolution of the data platforms that all this stuff runs on. You can summarize that in like Price Per Performance and these kinds of characteristics. Fundamentally, SQL basically does what it has done for fricking 40 years or whatever. And then the tooling on top of it. There's areas in our ecosystem that have a lot of movement: data observability, data quality, cataloging. These kinds of things are very fast-moving right now. And then maybe there's another, like a next wave of data analysis products that are coming out. I end up staying on top of stuff by curating a newsletter. I have - for six and a half years now - published a newsletter called...that now is called the Analytics Engineering Roundup. It goes out every week. I write half of the episodes or the issues. It is this really great accountability tool to make sure that I actually have something new to say every two weeks, because otherwise it's incredibly easy if no one's...when 15,000 people are going to read the thing that you just put out there, you feel a lot of pressure to say something correct and novel and interesting. But otherwise it's very easy to not invest that time. Lukas: Totally. It's funny, I actually use those external forcing functions too. They're so effective and I always get really nervous before I have to put out something like that. Or I sometimes set up talks with topics that- Tristan: -you don't know the answer to yet. Lukas: -I don't fully know about yet, I need to force myself to figure it out. Sorry for those of you that have watched those talks and thought I didn't look like I knew what I was talking about. Tristan: Well, sometimes it turns out that they go great, right? Lukas: Totally. Tristan: And then every once in a while you're like, ""Ah, that wasn't perfect."" Lukas: I feel like sometimes if I give the same talk too much, I find myself getting bored in the middle of the talk. And then I feel so sorry for the audience, because I figure if I'm getting bored, the audience must be bored out of their minds. Tristan: I have a tremendous amount of respect for professors, for teachers, who keep the energy level up delivering the same stuff over and over again. Lukas: Totally. Okay. Well, tell me about starting dbt. I'm sure everyone asks you that, but it's such an interesting question. I'm curious what you were thinking when you started it. Was it just a rocket ship from the beginning? Or was there kind of a moment where something changed, and this started to really build traction? Tristan: The origin story of dbt is that I was burnt out from venture-funded startups. I'd worked at three of them. I think that, as a community, venture-backed startups are getting a little bit better about work-life balance. But inconsistently so. Certainly back in 2015, that was not the case at all. I'd been working for 11, 12 hours a day for like 7 years. It was like, ""Okay. I'm done with that,"" and I really want to go back to data. I had started my bareer in data, and then I'd gone to different...I wanted to get back to actually having a pure data job."" And so I was like, ""How do I this?"" And how do I do it from Philadelphia? Because I'm married and my wife has a cool job and she's like, ""We're not moving."" I decided to start a one-person consulting shop and I was just going to help companies implement what became known as the modern data stack. So, a data warehouse, a data ingestion tool, a BI tool. And I was going to help them do their internal analytics. The thing that was clearly to me missing was data transformation, which was a part of how the stuff had been done in the past, but there wasn't a modern data stack solution. Got my friend and coworker Drew Banin to help me build the early versions of dbt. It was not so many hours that was put into the initial versions of dbt, dbt is not that complicated. Drew joined and we started using it on consulting projects. It was really our consulting clients who got exposure to dbt. They said, ""Hey, I want to start using that tool."" And so they would train their internal people. The big locus of where the community came from was back in 2016, Casper got turned on to dbt and they were kind of a big deal in the New York tech scene at the time. They told their friends and so Kickstarter and et cetera. It was a New York tech thing. If you look at the graph — we do anonymous event tracking inside the open source product — if you look at the graph of the number of different organizations using dbt over time, that graph has grown at 10% every single month for five and a half years now. It does feel like- Lukas: -can I interject? I'm so jealous, I'm so jealous. That's amazing. All right, go on. Tristan: At the beginning, we didn't even focus on it because we didn't have a way to make money off of that. It was just like, ""Whatever, that's cool that the community is growing."" And then we got to a point where we grew from 300 to 1000 companies using it over the course of a year. That's when the Fortune 500 companies started calling us and were like, ""Hey, we'd like to buy stuff from you."" And we're like, ""We don't have anything to sell you."" That was when we kind of changed directions and became more of a software company. But there was no single point where it all came together. It was just this...people underestimate the power of exponentials over long periods of time. Lukas: Totally. I guess another funny thing about dbt is that it seems so conceptually simple, doesn't it? It's funny...I feel like these are mean questions. I was asking the Spark founder ""What makes Spark complicated?"" and ""What makes Ray complicated?"" All these things, at their core, seem simple. What makes dbt hard to build? Tristan: The simplicity is...I don't want to take credit for that, but I think that one of our main driving product goals is to be simple. Lukas: Who else would take credit for that? Can't you take credit for that? Tristan: Mitchell Hashimoto should take credit for that, because it's a straight-up copy of Terraform. The user interface paradigm is...my other co-founder, Connor, was an infrastructure engineer at our last company together. I was telling him about this need that I had. And he said, ""Have you ever seen Terraform?"" This was back in 2016, so Terraform is still kind of new and cool. It was like, ""Let me show you this thing."" He showed me the HCL behind it, and then he did a tf-apply and I was just like, ""Holy shit. That's really freaking cool."" Once you've seen Terraform and you've used it, you're just like, ""Well, obviously that's how I'm going to do that moving forwards."" That was the product goal of dbt at the outset. It was Terraform for analytics. On some level, what dbt does is it takes SELECT statements that are inside of .sql files on your machine and it wraps them in CREATE VIEW or CREATE TABLE as SELECT statements. And then it does some DAG processing with NetworkX and Python. On some level, that is actually quite simple. The hard parts come in when — there's a lot, and I'm not the person who built it, so you're going to hear it passed through a less technical person's mouth — Jinja is really meant to be used as a web templating language. It's meant to process one HTML page at a time, like request/response. And in that context, it works quite performantly and all is well. In dbt, because all of your pipelines together make a DAG, what dbt has to do is it has to read all of them at startup time in order to understand the shape of your entire DAG, so it can know what work it needs to do. If you have 50 of these, that's not a problem. But we have users who have thousands of these. And it turns out that it's quite challenging to read thousands of files from disk and operate on them in a way that feels interactive to a user on the command line. I think the team last year was four people. We spent four person-years of engineering time last year almost exclusively on performance. So that's an answer. There's many answers to...once you go deeper and deeper down this hole, and I'm sure you've experienced this too. Sometimes the decisions that you make early on in the process of building something, you come back to later and you're like, ""Wow. Gosh, I didn't realize what a bad idea that was going to be."" Yeah. It's a constant iteration cycle. Lukas: How about documentation and API names and things like that? How do you feel about how well you've done on that? That's always something that I reflect on with Weights & Biases. Tristan: Oh, we're not great at that today. Our...your APIs, your whole product is commercial product, right? You don't have open source surface area? Lukas: We do, actually. We have a client that's open source, and then the APIs are...anyone can call the APIs and pull stuff out. But yeah, the client is open source. It could go anywhere. Tristan: So, we have this funny thing where we have two different types of users. We have users who tend to be less technical. There are people like me, who their primary language is SQL and maybe some scripting and stuff like that. And then we have contributors, and that group is much smaller. They tend to be data engineers and not data analysts. We have historically prioritized the needs of users over the needs of contributors. And that has meant that we have — whether it's in the open source context or in our cloud product — we've historically under-invested in clean APIs. The open source product really exposes itself as the CLI. If you try to get in there via Python and call stuff directly you can, but we don't make any guarantees about the stability of those APIs. So we need to improve there. As we mature as a commercial business, we're increasingly taking the needs of data engineers seriously too, because dbt is increasingly this mature piece of data infrastructure inside of the companies that use it. Documentation and API design are very front-and-center in our world today. Lukas: Is it a command and control style management to keep the names consistent and things like that? How do you source community ideas and yet keep predictable names and things like that? Tristan: I don't know that we've dealt with the name thing as much, but I will say that we're not especially good at getting groundbreaking new contributions from the community. We have a real design ethos, the product is designed in a certain way. And it can be challenging for folks who aren't a part of all of these conversations about this to do big new things. I will say that we have done a better job over time of carving off spaces of the product that are much safer to get external contribution on. So we now support a dozen or so database adapters. And increasingly it is the vendors for those database adapters that maintain their own adapter. That's a very well-defined surface area. I've never run an Apache project, but I have a lot of empathy for people who are trying to run open source projects without a benevolent dictator for life. It's legitimately very hard to work through these kinds of things purely in GitHub issues or things like that. Lukas: Totally. And you wonder if the outcome of that kind of consensus building might not be as good as if somebody is just appointed, like ""You make the call and and drive forward."" I'm not saying this is necessarily better, but it's something that we think about it at- Tristan: -it certainly takes more work. Lukas: Yeah, for sure. I want to make sure I ask you about your community because you're so well-known for the quality of your community. Can you talk about what you do in community building and why they're even...I feel like a priori, you might not even expect there to be such a vibrant community around a tool like dbt. How did that happen? Tristan: I think it is very interesting. I want to have some epistemic humility in terms of...I don't know. I have my own guesses as to why this happened, but community is an emergent phenomenon. I think you could ask different people and different people would have different stories. Here's my belief. I think that there has been multiple decades of data people being undervalued. That the tools that are built for them underestimate their capabilities. And tools that lock them in. So you're less willing to give back to a company that feels like it maybe it doesn't have your interest at heart. For the first time, I think we said to data people that, ""We believe you're very capable. We think that there's this new way that you can work, and here's the little seed of a tool that will help you do that."" And I think that people...I think all communities are really communities of identity. They have to feel seen and recognized. That's what creates loyalty. I think that that's why data people — especially early on, but still today — feel a deep affinity for the dbt community. Because it's the place that they feel like they're really seen and they're not underestimated. Lukas: Interesting. That seems very plausible to me. I don't think ML engineers are maybe as historically disrespected in organizations — maybe they're kind of put on a pedestal — but I think Weights & Biases was one of the first companies with the point of view of, ""Hey, we're going to really serve this specific group."" Where I think most of the earlier MLOps tools came with a more top-down mindset of, ""We're going to sell into CIOs and sell high in an organization."" And I think whoever you sell to really ends up controlling your product direction, is what I've seen. Tristan: Totally. Okay. We do top-down sales at this point too, but it will always be a complement to bottoms-up, community-led motion. It feels very surprising to me that - and maybe it's just because I don't understand the full ecosystem as well as I'd like to — but it feels very surprising to me that not all companies today in data are started with bottoms-up motion. It's so much more fun to build a business like this, right? I want to build a good product for CIOs. I want them to value what we do. But I want to spend my time talking to people that do the work. It's just more fun. Lukas: I feel exactly the same way. Tristan: Why do you think there's still so many companies that build tools that intend to be top-down? Lukas: Well, I think that building a company that sells lower in an organization first is a slower road. Tristan: Yeah. Lukas: People have less budget. And so in a smaller market where you need to do bigger deals, it might be necessary to sell higher in an organization. My first company, CrowdFlower, intuitively started off with a bottom-up sale, but towards the end really ended up serving folks higher in the organization, just because I think the ML market at the time was smaller. So you couldn't do it. I think me and you are much more of the temperament to sell to the people that are actually doing the work as I think of it, but... Tristan: Maybe it is a market maturity thing. I think that there...that the places in our space that are generally a little bit more tops-down are things like governance and cataloging and things that you need a lot of standardization. Maybe there's a compliance buyer, things like that. Lukas: Do you think of Databricks and Snowflake as a top-down or bottom-up sale? Is it obvious to what they are? You can kind of get started off the website, but I sort of view them as doing more of a top-down sale from my perspective. But you would know better than me. Tristan: It's an interesting question. When I think about this, I think about...when a sales person engages at a company, do they have to educate that buyer on what their thing does in the first place? And for Snowflake and Databricks, I think by and large, their buyers already know who they are. The job of the salesperson in that context becomes partnering to make sure that...there's like a million hurdles that will prevent you from effectively using Databricks, or any data platform. So the sales person almost has to just project manage their way through both the consensus-building process and the actual implementation process. But I think that sometimes when you go to buy a data governance tool, it's like, ""Well, I don't know. What governance tools exist? Well, let's research them."" I would much rather come in...when we talk to data leaders, they're like, ""Yeah, we know dbt, we heard you on the A16z podcast."" They probably already have some people who have tooled around with it internally. It's such a more fun conversation to have. Lukas: Totally, totally. Well, it's hard to do that. You've made a product that many, many, many people use. Growing 10% every month, this puts you in a rare category of growth. Do you have thoughts around where the data world is going? What parts of the stack are likely to change in the future? Tristan: Gosh, that is a very big question. I just spoke to...I wrote a blog post at the end of 2020 that made five predictions. By and large, I think that those stand up pretty well, but I think there's a new set of things that probably needs to be written. I just talked to a company that is building a layer that allows you to turn your data warehouse into a transactional data store. That is very interesting, because if you think about all these SaaS products that have been built over the past 15 years, each of them has their own separate data store. You have all this data engineering to do to make sure that the right data is in Salesforce. And then that the Salesforce data comes back over into Zendesk. It gets a little silly. You could imagine that, ""Well, we've centralized all of our organizational data with these data pipelines - that were initially built for analytics - and the data warehouses themselves are primarily built for analytics too, but what if we could have another data store that sat on top of it that had more transactional capabilities?"" And would allow you to have lots of queries per second and good insert and update times. Not just that capability, but the idea that the data warehouse will stop being just for analytical use cases and be for operational use cases, I think is a very interesting thread to pull on. I have no insider knowledge here whatsoever, but my guess is that Snowflake and Databricks would love to invest in technology to...if you look from the outside, Snowflake has changed its messaging over the years from being a data warehouse to a data platform. Now it's a data cloud. The game in compute is you want to handle more and more and more and more workloads. I think there's a lot of reasons that we as data professionals, should like that. Because it means that we wouldn't just be doing things in service of analytics. We could actually be a part of the product development organization side of companies too. Lukas: Wouldn't latency need to come down to do that? You're talking about being literally something that the product actually queries in production. Tristan: Totally. So, imagine...I think there's different ways to do this and I've heard different proposals, but imagine that there's a caching layer on top of the warehouse. It's using replication to get a very consistent state of the world, maybe there's a small lag between the data warehouse information. You could imagine latency that actually was acceptable for a production application use case. Lukas: I see. Interesting, interesting. Tristan: There's VCs that are all over...Martin Casado, who's on our board, is very bullish on this trend. Tom Tunguz was writing about this two years ago. I've always wondered like, ""Okay, but, but the data warehouses don't actually...they can't service that type of query pattern today."" But maybe if you just like wave a magic wand and you're like, ""Somebody's going to fix that,"" then you could see some interesting things happen. Lukas: Interesting. It's funny. One space that I think is kind of unsexy to VCs, but still seems surprisingly broken to me is BI tools. I guess that's part of the stack, but it's just funny. I think so much money has gone into it. Every company uses it. There's like clearly a market there. But I feel like I haven't seen a lot of new things happening and it's still quite a frustrating experience as a CEO. Tristan: I do some very, very small-scale angel investing and that is the area where I'm most interested in. I agree that many of the BI or analytics layer products that most companies use today were started roughly 10-ish years ago. Which in the world that we are operating in is kind of a long time. Lukas: Totally. Tristan: That doesn't necessarily mean that there's anything wrong with them, but I do want to see new takes arise. I think that that's starting to happen. I think that sometimes it is because in the same way that Redshift kicking off the wave of the cloud data warehouse changed the priors for ""What has to be true for me to make an application that looks like this?"", dbt changes those priors again. If you're building a BI tool, you can just assume that somebody is going to have a dbt project. You can actually plug into the graph and you can know a bunch of information about somebody's data before they've done literally anything in your product. Lukas: Interesting. Well, we've talked a lot about data, but this is...I mean, ML is so closely related to data. I'm curious, is ML relevant to your company at all? Do you have any people working on ML internally? Do you think about ML when you think about what dbt should do? Tristan: There are things that we care about — from an ML perspective — that we have not yet gotten to. They are frequently in the realm of developer experience. We have an IDE — a browser-based IDE — that we sell to companies, and there's a lot that you can do in that context to reduce the time to get from Point A to Point B. We have access to a lot of exhaust that comes out of the millions of dbt jobs that we process. And it would be great to use some of that to predict good and not-good patterns for the way that you've built your DAG, written your code. None of these are things that we've...we operate solely in the land today of building developer tooling using very traditional approaches. But this stuff is not so far around the corner and I'm excited about it. Lukas: Cool. One more question before we get to the last two. How do you feel about SQL? It's been such...I feel like of all the computer languages, it's survived the best. I feel like everyone knows SQL, everyone uses SQL. Something must be really good about it, I think. Do you think it became a standard early and has just sort of stuck around as a standard despite its flaws, or do you feel like there's like some brilliance in it that makes it work? And do you wish that it would be replaced by something more modern? Tristan: Standards are really interesting. I don't know that there's a technical answer as to why TCP/IP and HTTP are like the founding protocols of the internet. I think that they worked well enough and people consolidated around them and then you have an ecosystem and there's- Lukas: But wait, but wait. Languages don't usually work like that, right? I feel like the languages that I learned in school, even now they're not mostly...I learned Perl, that was the thing to use. And you don't see that much anymore. Tristan: That's a great point, but I think that what happens with these protocols, with TCP/IP and HTTP is that they get baked into products. They get baked into the Apache web server, they get baked into...et cetera. And that has network effects because when all the other vendors support this thing then, ""Well, we got to support it too."" And then everybody just kind of agrees, ""Okay, this is good enough."" With a language, with Python, you run it yourself. You don't need it to be executed anywhere else. Every individual engineer or engineering team can kind of choose Python or Go or TypeScript or whatever. And they get to make that decision without any network effects being involved at all. But SQL is more like HTTP than it is like Go because you, as the person choosing to write it, are not controlling the execution environment. You buy a database and there's only certain number of databases and they all use SQL. Well, maybe not all of them. But, by and large, most of them, historically. So not only are there these network effects around, ""Because the vendors support it, then I have to learn it,"" but then there's the return network effects where like, ""Well, because everyone knows SQL, I am also going to build a product built around that."" Snowflake could have said...Snowflake was a brand new database. In 2012, they could have said, ""We're going to invent our own language."" But that doesn't make any sense because Tableau already works with SQL and everything already works with SQL. Lukas: I guess it's funny. Java or the JVM has some of that. And then you see stuff like Scala getting written on top of that, or are compiling down to that, but yet everything that compiles down to SQL is just enraging. I feel like every time I've used a higher level on top of SQL, like all the different versions, I feel like I've tried them and something about it- Tristan: -like Active Record or an ORM or something? Lukas: Like every ORM is just...at first it feels good. And then you just like tear out your hair- Tristan: -you get into the edge cases and it's terrible. Lukas: Yeah. Why hasn't someone built a higher-level construct on top? Tristan: I totally agree with that. We didn't talk about this pre-taping, so I'm so excited to be talking about this. That is generally how standards progress. There's this base thing, and then people are like, ""Okay, that's good enough for what it does."" And then they're like, ""Well, let's build a higher level of abstraction"" and it will solve some of the...this is like JavaScript and et cetera. We talk about this internally as ""Who's going to build the React for SQL?"" And I'm very interested in that question. I believe that will happen over the next five years. I think that there's too much money floating around in incentive to want...the way that dbt works, it's very similar to Ruby on Rails back in the day, with .erb files. There's templating. But we didn't build React. And I think that either we will, or somebody will. If somebody builds it and it's not us, then I'm very happy to just have it be another choice of language that you can put into your dbt DAGs. I agree. I think that we've — using templates — we've made a lot of progress in what you can idiomatically express in SQL, but it's still not as pleasant of an experience as just writing other languages. Lukas: Are you working on this? Tristan: No, not today. The one person who is working on this in public is Lloyd Tab, the founder of Looker. This has been Lloyd's passion project for a little while. It's called Malloy. You can find it on their public GitHub. It's very interesting. It's not exactly how I would build it, but also I recently got a demo from him and there's some real magical capabilities there that I had never even thought to want out of my SQL-like language. I don't know. I would like this as much as you or anybody else. Lukas: Very cool. Well, we always end with two questions and I want to modify them for you, I guess. We usually ask what's an underrated topic in ML, but maybe I'll ask you what's an underrated topic in data. I mean, we've covered some of them, but what- Tristan: -can I answer in ML? Lukas: Oh, sure, please. Yes. Absolutely. Tristan: I think that ML has a persona problem and that there's been some reckoning with this. There's some ""make ML more accessible"" tooling. In general, I don't feel like that has been spot on. It's clear that the tools for the big kids are really where everybody's focused on today. There are some...there's a company called Continual, there's a couple other companies in the space of trying to bring ML to the types of workflows that people in my world use. And I would desperately love that. I'm very familiar with what is going on inside of an ML model, but it is also clearly not exposed to me in a way that is idiomatic for me to participate in this workflow. So I'm excited about that gap being bridged. Lukas: Interesting. Like making it simpler to just make an ML model from a set of data? Tristan: Yeah. What Continual is doing is they're actually plugging into the metadata inside of dbt. And you can actually add some additional metadata properties that declare certain fields inside of a dbt model as being your features and the success criteria. And then Continual kind of plugs in with its own AutoML process and trains a model and dumps it back into your data warehouse for you. Lukas: Wow. Do you actually use that? Or what- Tristan: -I don't, they're super early. I would like to get my hands on it and use it myself. They have customers though. Lukas: Cool, awesome. Continual. I'll check them out. The final question is usually ""What's hard about getting ML working in production?"" and people usually answer that question...we should actually do a graph of this, but I think that the most common answer is usually the data pipeline feeding into the ML model. Within that, when you see companies trying to set up a working data pipeline, what's the long pole? What's the place where people usually get stuck? Tristan: Debugging data pipelines is very hard. It's not very hard for people who live in this world all the time, every day, but it's still effort and time-intensive for us. I think that the whole world of observability, reliability, all of this stuff...my answer to ML in production is kind of...I don't totally understand why...so, dbt runs on Spark, dbt runs on Databricks. Both Spark and Databricks have SQL run times, so we can plug directly into them. And yet, that is not where most of our users are today. There's fundamentally not that much difference between doing feature engineering and doing what we would call data transformation. You're doing the same damn stuff. I think that the answer to why these two groups of humans do not consolidate or collaborate more effectively is, again, the same reason that it goes in reverse. Most ML people, I think, don't think in SQL. I'm excited because more and more of these data platforms are exposing Python remotely. dbt does not do any local execution at all. We ship SQL to a data warehouse, which executes it. And the funny thing is that that type of interactive work doesn't exist in the Python ecosystem as much. Mostly it's like, you're on a machine. You're running it there. Databricks has a notebook API that we can plug into to actually run PySpark code. Snowflake has a new thing called Snowpark where you can run execution of Python. I think that we are going to be working from our end to close this language gap that exists in practitioners today. Lukas: Cool. Awesome. Well, thanks for your time. This was super fun and I learned a lot, so I have a feeling our audience will also learn a lot. Thanks. Tristan: Thank you. It's been a lot of fun. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.",8740 +Johannes Otterbach — Unlocking ML for Traditional Companies,https://www.youtube.com/watch?v=aGq4zFT2tuo,2694,2022-05-12,"Johannes: If you take those big models, you'll run into a problem. You need already compute power. You need infrastructure. You need MLOps. You need a whole department to actually make use of those models. Not many people have that, especially those companies that it's most useful for. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Today, I'm talking to Johannes Otterbach. He was originally a quantum physicist and then went into machine learning and he's currently VP of Machine Learning at Merantix Momentum. Merantix is a really interesting company that develops all sorts of machine learning applications for customers, small and large, and then really deploys them into these real-world systems and hands them off. We really get into real-world applications of ML in factories and other places, the tooling required to make machine learning work, and also how to do smooth handoffs so that customers are actually successful in that transition. This is a really interesting conversation and I hope you enjoy it. Lukas: My first question for you is looking at your resume, you're one in a long line of people that kind of moved from physics into machine learning. I'd love to hear what that journey was like, you know, studying quantum physics. And I think you worked on a little bit of quantum engineering or quantum computing. And then now you do machine learning. How did that happen? Johannes: That's a great question. I think initially I was super excited about physics because physics, I just saw as something to understand the world. I'm really excited about understanding how things work, taking things apart to put it back together. That's always drawn me to physics rather than engineering. I was on track to just do a career in physics and then AlexNet came out and the ImageNet challenge happened. Like, ""Holy crap. There is something really cool happening."" It's always funny to tell people I did my PhD before ImageNet was a thing, because that makes me really old. But it was kind of an exciting time. And so when I heard that, I was like, ""Well, I want to reconsider my career as a physicist anyway,"" at that point, and looked into what this AlexNet was about and the ImageNet challenge. This covered this whole field of data science and big data that was starting off at that time. That's a very natural transition for a physicist because we are good at statistics, we are good at modeling, we like math. And then I fell in love with this big data, data science. And since then, I've been continuously driving at understanding the language of data. ML is just like an expression of that language, and that's why I fell in love with it. And now I'm here. Lukas: You did do some work in quantum computing, is that right? Do you think that quantum computing has anything to apply to ML? Or do you think ML has anything to apply to quantum computing? How do you think about that? Johannes: I think it actually...it's mutually beneficial and I see there will be a convergence of those two fields in the near future. There are four different quadrants that we can talk about. We have classical and quantum, in terms of engineering and in terms of data. You have quantum data and classical data and you have quantum algorithms and classical algorithms. You can actually start to think in those four quadrants. I think that right now we see that a lot of effort is being put into using quantum algorithms to classical data. That I think is actually potentially the wrong way to think about it. We should always think about like quantum algorithms for quantum data and maybe classical algorithms for classical data. These crossfields are a little bit more complicated to solve. I think cross-fertilization is going to be happening. Lukas: What is quantum data? Johannes: Quantum data is essentially data that comes out of quantum states. I don't know how deep you are into quantum computing, but typically in quantum computing we don't talk about definite outcomes in a way, but we're describing systems by wave functions, which are — naively speaking — the square root of probabilities. Quote unquote. Don't take this too seriously. What you get with this is essentially expressions through quantum data, which just has a phase and an amplitude. If you start measuring this, you get a lot of complex numbers, you get various different types of phenomena. And those data typically take an exponential time to read out into classical states. When you have a quantum state and you want to completely express that quantum state as classical data, you get an exponential overhead in storage. Lukas: What's a situation in the real world where I would have quantum data? I can imagine how these quantum computers produce that, but when would I be collecting quantum data? Johannes: When you actually deal with quantum systems. If you want to start to understand molecules, for example. Very deep interactions of molecular properties, they are ruled by quantum rules. If you want to simulate molecules, you would rather want to do it in quantum space than a classical space. That's really the way to go. That's why modern or early stage today's quantum computers are more simulators of other quantum systems. You use these computers to simulate quantum systems that you want to study in a very, very controlled fashion. And then you deal with quantum data at that point. Lukas: Are we actually able to simulate quantum systems in a useful way? Because, you know, I have experience with classical mechanics systems and the simulations seem to break down very quickly. I can only imagine that the quantum simulations are much harder and probably harder to make accurate. Johannes: We are getting really good results. A lot of quantum experimental physics is essentially doing that. We have toy models that we use in order to validate our mathematical theories. A good example is a field that I worked in back in the past, which is quantum optics, where we have a lot of laser fields and single atoms. And we start to put them together in a certain fashion in these laser fields so that we can simulate materials that we really have a hard time understanding. Like for example, high temperature superconductivity. We have certain types of mathematical models — statistical models — that we think about like how these things can come across or can come about. And then in order to study the effects of these models, we use a very clean system that we have a high level of control for, and try to simulate those mathematical models, and see if those models then give rise to these phenomena that we see, for example, in these materials that have high temperature superconductivity. So we use a much simpler system to stimulate a much more complex system in order to probe our understanding of the physical loss here, in this case. Lukas: Is there applications of ML to that? I feel like we've talked on this show to some chemists in different fields, and they've been sort of using ML maybe to approximate these kinds of interactions. Is that an interesting field to you? Johannes: I think that's an interesting field to me. But actually I think I'm much more excited about a completely different avenue of applying ML to quantum systems. If you think about building a quantum computer, you have a lot of different qubits. These are like the atomic units. You have bits in a computer, a classic computer. You have qubits in a quantum computer. To use and address these qubits, we have to very, very meticulously control those qubits in order to really make them do what we want. You cannot just flip a switch from zero to one, but you have to control everything between zero and one. It's a very, very analogous computer, like an analog computer to a certain extent. And in order to control these kinds of systems, I think here is where ML comes into play because you can use, for example, reinforcement learning techniques to do optimal control of these quantum gates in order to facilitate those two-qubit interactions or three-qubit interactions now to get a high-fidelity quantum computer. And I think that might be the one of the early applications of ML to quantum systems and to quantum computers. My firm belief is that we probably need machine learning techniques — modern machine learning techniques — in order to scale quantum computers to sizes that are actually useful Lukas: Interesting. I feel like I've met a number of people in machine learning that kind of feel like they're refugees from quantum computing. Like they felt like it didn't really have a path to real world applications and kind of moves into machine learning. When I saw your resume, I wondered if you were one of those people, but it sounds like you're pretty optimistic about the future of quantum computing. Johannes: Yeah. I think that the question is on which timescale, right? The quantum computer is still very nascent and I feel that quantum computing will go through like the same kind of winters that machine learning went through a while [ago]. When this will happen, I don't know, but we will see these kind of winters coming out. I — in my lifetime — want to see some more impact on a shorter-term time scale. And I think that machine learning is the right path for that. I actually don't think that I shut the door. At some point I want to do a bit of quantum computing again, but maybe take my ML knowledge to quantum systems in order to facilitate some better approaches to do that. But right now, quantum computing is very much at the hardware level and I'm a software guy. Lukas: Cool. Well, tell me about your work at Merantix. Maybe we could start with what Merantix is and what you work on there. Johannes: Yeah, sure. Merantix is a super cool construct actually. We have two separate units, we're half Merantix Momentum and we have Merantix Studio, which is the overarching company. Merantix Studio is actually a venture studio that focuses on deep tech in Berlin. The idea here is that we have like pre-vetted industry cases, where we've then looked for what we call entrepreneurs-in-residence that want to work on certain critical domains that we deem necessary in order to bring AI into broad adoption outside of just B2C businesses. The venture studio looks at those different use cases then starts to seed an entrepreneur-in-residence, lets them have six months to a year of vetting the use case, and then build up their venture. Merantix Momentum is one of the special ventures because we are actually not an independent venture, we are a 100% subsidiary of Merantix Studio. We are focusing on these use cases where it's not big enough to actually build a venture by itself, but actually still need help for certain domains. We try to focus on use cases of clients that have actual problems to see how can we actually apply ML techniques and ML deployment techniques and MLOps to help those customers in need. Classic example is, for example, visual quality control manufacturers. They have no IT stack, they have no IT system. But they have very hard visual quality control problems. So building a vision classifier based on a convolutional network just offers itself. We build that for them and make sure that it's actually scalable and then also help them put it into production close to the sensors. You can't build an own venture around it, but Merantix Momentum can actually do it. That's what we're here for. And so within that ecosystem- Lukas: -why do you think you can't build a venture around that? I mean, it seems like that'd be pretty useful to a lot of people. Johannes: I think the question is how quick do you gain significant market cap, right? I think eventually you can build a venture around this, but I think the adoption is not big enough yet in order to build your own venture around it. In a way, Merantix Momentum is the venture that can actually do that. Because we're...in that sense we are a professional services department where we go in and say like, ""Hey, you have a problem. You want to have a one-off machine learning model. We can help you get there."" That's what we're doing. So that's kind of the venture around that. But like, you wouldn't build a venture to just go out and do visual quality control for company X, Y, or Z. Lukas: How does it work? I mean, I would think that doing this kind of thing for customers would be very hard to scope, right? Because I feel like one of the challenges of machine learning is you don't really know in advance how well a particular application is going to work. And then downstream from that, it'd probably be hard for customers to estimate how well different levels of quality of the model would really impact their business. How do you make sure that a company is going to be happy at the end of one of these engagements, or do you just view it as sort of an experiment? Johannes: That's a really great question. I think that we are getting some traction on that. The key here is to work earlier with customers to understand their needs. We really have very intense engagements before we start our work to make sure, ""Is the use case actually solvable? How big is the challenge? What kind of data challenges do we meet? Which kind of approaches would we actually take?"" And really take the customer on a journey before we really say like, ""Now we start engaging."" The way that we approach this is a staged approach where we have more individual workshops, which we call the AI Hub. Which is a pre-study to an actual work engagement, implementation engagement, so that the customer understands what can be achieved with which data, with which kind of effort. And then we start the implementation work. When implementation work comes, of course, it's a professional services. There's always a little bit of security and risk, but we already mitigated the risk significantly. Often it comes out that some problems are not solvable, and then we go to a different type of model. Which I'm actually working on. Lukas: What type of model is that? You work on unsolvable problems? Is that what I just heard you say? Johannes: Not unsolvable problems, but problems that you cannot just do in a client engagement, right? There's a different funding strategy — that also exists in the US to a certain extent, but much more so in Germany, in Europe — which is publicly funded research projects. The German state, or the federal government, is interested in solving certain types of problems that are industry-spanning, but they're too hard for just a single company to just work on it because you have to bring many, many different domain experts together. So they fund consortial research, which is typically like 4 to 10 partners where you have application partners that bring their challenge problems and datasets with them. Then you have academic partners that bring in academic state-of-the-art research facilities. And then you also have professional services company like us who really understand deployment models, deep tech industry applications, ""How do we make machine learning models robust?"". And you engage in translational transfer research to use the academic results to apply to industry problems. Once you solve that, then you have enough data to actually then bring it to a client engagement in a B2B relationship. Lukas: Can you talk about some of the things you're working on, specifically? Johannes: Specifically? Yeah, we have a bunch of research projects that are going on with big manufacturers in automotives in Germany. We just are about to finish a project on self-driving cars, autonomous vehicles. Very classic use case for Germany, I would say. Here, the idea really is that car manufacturers do not really understand all the details that are involved in building a, for example, segmentation map for optical flow application. But they are very, very good in understanding functional safety regards. And so really bringing those two domains together of saying like, ""We need self-driving cars, autonomous vehicles, but we don't know how to build the segmentation models. We need the domain expertise,"" and we say, ""We know how to build those segmentation models, but we don't know, actually, what are the safety critical features?"" How do we bring those together? That was a research project that we worked at. Lukas: Oh, that's cool. So you're doing segmentation on vision, basically, from vehicles? Johannes: So there's...computer vision is one of them. We were investigating synthetic datasets, where you have essentially a rendered dataset in order to pre-train those models. Optical flow detection, bounding box detection, person detection. These are some classic models. We also have other research projects that are much more going into optimization problems, where you need to understand how manufacturing pipelines actually look like. Cool example — I unfortunately cannot name the company name — but like imagine you have a critical element for building a car seat. There's metal bars. And these metal bars, they are funnily enough going through like 50 different manufacturing steps. Sounds crazy, but it's actually true. Those 50 manufacturing steps are distributed over 10 different factories of 5 different just-in-time partners. Lukas: Wow. Can you give me some examples of what these steps might be? It's hard to picture 50 steps in a metal bar. Johannes: The raw metal forming to the raw rod. Then the first processing to bring it to the right rod. Then you do chroming of the rod. Then you start the first bending iteration. Then you rechrome, refinish. Do the second bending, do the next step, and so on until it's in the right shape. There's a lot of these steps. Yeah. I didn't know about that either. It's pretty crazy. What happens now is that in your manufacturing process, a mistake happens at step number 10. You don't notice that mistake until step number 15, when your metal bar is a little bit outside of specifications. Typically what happens is that now we take this whole batch and you put it to a scrap metal and start from scratch. However, the challenge now is like, ""Can you do something in step number 20, maybe, that you can bring that rod back into specifications?"" So that at process step 30, 40, 50, it fits again back into specifications. Now we can imagine this is like a very high-dimensional optimization problem with a very sparse reward signal. Classic optimization problem. That's [the] kind of research projects that we're working at. And now is the question, what kind of techniques in the field of ML can we use and transfer to those kinds of problems? And what kind of data do we actually need for that? Lukas: So what would be the choice here? What would you do differently at, say, step 20 that might make it useful in the end? Johannes: We have to find what are the kind of levers, right? And there is different types of process that maybe you don't heat it up as much, or you over-bend it a little bit into one direction and rebend in the other direction. Maybe you do a refinishing at some point. These are all the levers that we have. We have to explore, ""What is the actual problem?"" And here you start to see that the devil's in the details. What are actually the defects that matter? Like it's a causal inference problem. It's a Bayesian learning problem. We don't know yet because we just started this project. I wish I knew the answer, but then I would have already published something around that. Lukas: Wow, so you're just working on a totally wide range of machine learning applications in the real world. Johannes: That's right. Lukas: You must be building a really interesting set of tools to make this possible. Can you talk about the stuff that you're building that works across all of these different applications? Johannes: Yeah, no, that's a super question because I think that's one of the things that we do extremely well, and we have a lot of fun doing that. Maybe let's start a little bit back because one of the challenge that we have — being in Europe — lots of companies have very, very little trust in cloud deployments here. You have to start with the customer and say like, ""What happens here?"" And one of the things that people are super afraid of is vendor lock-in. So we have to build a tool stack that really is cloud-agnostic. We can deploy on-prem, we can do it on GCP, AWS, Azure, you name it, whatever it is. That's the first prerogative; we need to understand how to build a stack that's completely agnostic of the underlying cloud. And so in order to do that, we start of course building stuff on Terraform and Kubernetes. We do extensive use of those systems to automate a lot of deployment tasks. So, infrastructure as code. Now, once you start to go into like all of these files, you're getting fairly quickly lost in them because these configuration files start to become very, very complicated. So we started to build tools to automate how we actually write deployment files. We have an internal tool — which we also funnily enough call dev tool — that essentially is nothing else than building very specifically pre-programmed template files in order to spin up complete deployments automatically. And so we are completely independent of the actual underlying cloud, because we can just spin up the templates of a full deployment cluster. And on top of that, we can then start using all kinds of other tools that we need in these clusters that we deploy. We're typically heavily relying on Dockers. So you build a Docker file that we can then deploy on a pod that we command using Kubernetes or Terraform. For the deployments then we use Seldon. We use a flight(?) pipeline to automate complete learning pipelines. CI/CD in that loop is done with flight(?). Right now we still have cloud build here, but we're already thinking about how to get that out of the loop. So we're trying to be really, really cloud-agnostic and build a stack ecosystem on these modern ML tools. Lukas: Does this stack that...this stack, I guess you're deploying into a customer's production environment. Does this include training or is it just for running a model for the customer? Johannes: It really depends on what a customer actually wants. We are right now...we're targeting towards MLOps Level 2, I think that's what Google calls it. We are not quite there yet, but so right now we still have a split between manually triggering a retraining that we do internally using our stack in the cloud or on their on-premise system. And then also having a separate manual step to actually deploy it into production. And we're doing both of them. We can actually do the deployment step and the retraining step using all of our infrastructure. And the target really doesn't matter, because we build it cloud-agnostic. We can, for example, do a re-training on our internal cloud, which we mostly use GCP right now for us. But if the customer wants to have the model in their production stack, we train it on our cloud and then move it to their production stack on-prem. Lukas: What have you learned building these tools? I mean, it sounds like you're making the stuff, you're deploying it. There's many, you know, people trying to build these things. What have been the kind of lessons, actually, when these things get deployed into customers' systems? Johannes: That it's really, really hard to do. Lukas: Why is it hard? Cause it's...conceptually it's simple. What actually really makes it hard? Johannes: It's actually not that hard if customers are okay with using cloud deployments. I think what makes it hard is if they're using on-prem in their own stack, because then suddenly the tools are not yet at the point where you can just abstract away every kind of sysadmin. You're always having this touch point between ""How's the hardware actually managed?"" and ""How can you deploy it?"" As soon as you have a Kubernetes cluster installed on-premise, you're probably fine again. But until you get there, you cannot abstract that system away. And then you're also getting these realities of the business, that you sometimes have to deal with IOT devices. Deploying stuff onto IOT, that really is not there yet. I think the tools are falling short on that end, but I think that it's just a matter of time until we have more tools that are ready for IOT deployments. Lukas: How do you think about monitoring the system in production? I'd imagine these things could be somewhat mission-critical, but I noticed you didn't really mention production monitoring. How do you think about that? Johannes: I think it's very important and we do it. We are not necessarily deploying extremely mission-critical systems right now. So that's what we haven't done yet. I think we're getting there soon. But right now, it's mostly just like measuring uptime and making sure that the stack doesn't fold under load. So it's just the standard production monitoring that is just Grafana load testing, throughput measurements, and these kinds of things. Not necessarily decision-making and auditing trails in that regard. So it's more like a standard site reliability monitoring that can be automated fairly easily using Grafana or any other monitoring tools that you like. Lukas: Got it, got it. I thought you might want to talk about some of the tools that that you've developed, like Squirrel, and Parrot, and Chameleon. Can you describe what these are? Johannes: Yeah, that's really cool. My personal favorite right now is Squirrel, just because we're just about to launch it and then release it out into the world, which is super fascinating. The goal here is that if you take a look into the ecosystem, we are very, very good in building ML models for training on single GPUs. But as soon as anybody encounters for the first time trying to deal with multiple GPUs, you get into big problems. And many frameworks have come across that are actually helping you to distribute a model, but nobody has really thought about, ""How do you distribute the data?"" And there are not many frameworks out there. There is a few things that we have looked at that are trying to solve that, and the ecosystem is getting bigger, but we are now decided we want to go into like a place where we can really make data loading on distributed systems as easy as possible. It doesn't need to be only for deep learning, but it can be for a lot of different things. And on top of that, also build in potential access control levels, right? Like you want to pull that one from this packet, the next one from that packet, the third one from this packet, and make sure that you mix and match this very well. That's what Squirrel's really about, to make data access and data storage and data writing super, super simple. As simple as you can do it by just abstracting away a file system. You can be on a cloud, it can be on local, it can just be pulled from the internet. And it should be easy to integrate in any kind of framework. That's really what we're doing here. Lukas: And your plan is to make this open source? Johannes: The idea is to make this open source. Exactly. Lukas: Cool, cool. I guess, do you have a preference of other open source tooling? Do you guys kind of standardize on your ML framework and things like that? What's your set of tools that you would typically like to use? Johannes: I mean we, of course, are also standardizing as much as we can. You can imagine, having many, many customers who want to have standardized tools. Our standard framework is PyTorch. That's what we're doing internally for training these models. We're also getting a lot of PyTorch Lightning as an easy framework. We're also using Hydra — that's developed by Facebook — as an interface and an entry point into those systems. Lukas: Why did you pick PyTorch Lightning? What did you like about that? Johannes: I think the idea here is that it really abstracts away much of what ML training frameworks have to do. You're writing a data loader. You're having an optimizer. You're having a training loop and you have a logger. And typically when you just look at typical GitHub repositories, everybody writes ""for a batch in dataloader do all of these kinds of things"". It's a very repetitive code. Like, just abstract this away, do some software engineering so it's robust, and then you can go with that, right? It's especially important if you're doing production models or you just have to retrain and you need to be stable on that. Software maintenance is, I think, one of the things that is not really in the academic ML community. Which comes as a surprise to me, because the field that is coming out of engineering should value good code quality a little bit more, I feel. So we have to do it ourselves. So, use tools that make maintenance and debugging of machine learning models easier. Frameworks are the way to go for that because you don't want to build it yourself if the community can help you maintain the systems. Lukas: Do you also use PyTorch to run the models in production? I know some people will kind of change the format or do something to the model before it's deployed. Do you just load up the model as serialized from PyTorch or do you do anything special there? Johannes: No, we typically deserialize it from PyTorch directly because right now our motive is to ship Dockers around the world. I think eventually we probably — for certain applications — need to go into a more standardized framework, like ONNX or something like that. That will change the game potentially. But right now we are still using the binary Docker. Lukas: Where do you see gaps in the tooling right now? As someone that likes to make and sell ML tools. What parts of the stack feel mature and what parts feel broken? Johannes: What feels broken to me is that you have to plug many systems into many systems. That feels a little bit sad, because that makes it really hard sometimes to stay abreast of the edge. I don't think that there's anything lacking in the community right now. I more feel like the problem is that too many people are building too many tools instead of just coming together and taking one tool and bringing it to the next level. The thing that then happens is that people try to be different from others instead of making one tool that solves a lot of problems. Counter example where this worked really well is in the data science world, right? You just need two or three libraries in the data science world, which is scikit-learn, numpy, and pandas. And you're set. If you're going into [the] MLOps domain, I don't know how many tools [are] out there. You probably know better than me. It's just...I wonder sometimes why. Lukas: Yeah, that's fair. I mean, I definitely think there's always a moment where there's an explosion of ideas and tools and then things start to standardize for sure. And I think we're still at that explosion stage. Johannes: I think so. Lukas: That's what makes it interesting to be in this world right now. Johannes: I agree. I think that there's a lot of abstractions we haven't figured out. Like, for example, deployment to IOT. But I'm super curious about...that I haven't seen much development until recently is, ""How do you deploy models in heterogeneous environments? How do you train on heterogeneous environments?"" I think there is still a lot of ML tooling that needs to get better. Not everybody has a huge data center of homogenous hardware. So how do we deploy models or train models on heterogeneous hardware? Lukas: I guess another question I have is, how do you hand off these models to a customer? You give them a Docker, but if they want to keep iterating on a model — once they've taken it from you — are they able to do that? How do you think about that? Because it does sort of feel like machine learning projects are never really complete, if you know what I mean. Johannes: Yeah, no, I understand what you're saying. It depends on the customer. I don't think that there's a one rule fits all. Some customers just come back and say, ""Hey, we need retraining or we need a fresh up. Can you do that for us?"" Because they don't have an IT department. Some people want to jumpstart their IT department. They say, ""Okay, we know machine learning is the future. We don't have an IT department yet, but maybe engage with you and you help us to jumpstart the engine,"" right? And then they start continuing on that goal. It's always of course a conversation because it's also tricky for us to say, ""Hey, we're offering our expertise, we put in a lot of sweat, tears, and blood, and then you take it to the next level."" That's always sad as well. So it's just always a tricky conversation. But we're happy to help people. And I ultimately think that everyone benefits, if the community just grows. Lukas: I guess another question I wanted to ask you about is, you've written a few thought pieces on AI. I don't know if you have a favorite, but I think one interesting one was your writing on the impact of NLP models on the real world. If you could summarize for people who haven't read it? My perspective is that, in a way, the NLP field seems to be doing a whole bunch of very amazing things. And I know people argue about, ""Is this real intelligence or not?"" or like, you know, ""How much does it really matter?"" But I guess from my perspective, as a technologist and enthusiast, I kind of can't believe how good text generation has got, in some sense. And yet I think the impact to me is smaller than I would've imagined from how impressive the demos look. I don't know how you feel about that. Johannes: No, I see your point and I think that's exactly the reason why I like working where I am. Because it's right in the middle of driving the adoption of modern AI techniques. I think the reason why you'll feel the impact is not as big as it could have been or should have been is that it's really, really hard to bring technology like that to people who are not technologists like us. That's really the challenge here. You have to bridge that gap. And there is this early adopter gap and that needs to be bridged, and we are not there yet. I'm also with you. I don't really want to get into this philosophical debate. Is it intelligent? Is it conscious? Whatever it is, it's useful technology. Let's bring it to the people and have them have a better life with it, right? Let's solve some problems with that. That's maybe the philosophical side. The practical side is, if you take those big models, you'll run into a problem. You need already compute power. You need infrastructure. You need MLOps. You need a whole department to actually make use of those models. Not many people have that, especially those companies that it's most useful for. Take, for example, news outlets or media outlets. They are completely focused on a very different problem. They don't have technologists that just take a GPT-2 or even a GPT-3-sized model to put it into production and then figure out the use cases, right? That's just not how the economics of these companies work. Bringing it to those people, it's just really hard. That, I think is the reason why we don't see that impact yet. It's going to come, but it's still going to take a few years. Lukas: What do you think are the next things that we're going to notice — just as consumers — from the impact of these more powerful NLP models? Johannes: I do think that a lot of stuff that will come is improvements in search. I think that the the signals that we get from similarity clustering is significant, and we just need to figure out how to adopt that into real worlds. If you just run GPT-3-sized models the search is slow, so we just need to do some improvements on that. But I do think that we see a re-ranking on that front. I also think that a lot of automation will happen for automated text generation, and that's a positive thing. I don't know how much time you spend on emails. I certainly do a lot and you probably do too. And it would be nice to just automate some of that stuff away. I also talked to several customers in Germany that have this funky problem where they're in a logistics space. Logistics is a very old-school domain where you get very free-form order forms. There are armadas of people that just do nothing else than take those emails that are just free-floating, written, and turn them into structured texts by just manually copy-pasting into a structured field. Sounds easy. It's not. It's a very, very hard NLP task. Once we bring these big models into that realm, I think there will be a lot of automation for the better. I do think there's a lot of potential. I'm very excited about the future of those models. Lukas: Cool. You also wrote an article on AI and regulation I wanted to touch on. I'm curious your perspective on regulation. I mean, obviously it's coming, but I'd be interested to know what you think about it. Like what good regulation would look like. Johannes: If I only knew, right? That's a good discussion. I think being in Europe, one of the things that I needed to learn is, ""How can you use regulation in order to build value systems of a society into your AI deployments?"" And that can be a good thing. I think the regulation needs to address the realities of AI as being an experimental technology and we need to deal with these uncertainties, but also make sure that we are not opening the door for extreme abuses, and give people and consumers the right to protest. How to exactly build that regulation? I don't know. I think that what I appreciate about the regulatory frameworks that we have in the EU is that we are more willing to iterate on regulations, which is good. We make a draft, we see how it's being in practice. Some things work, some things don't work. We try to adjust. Classic example, GDPR and the cookie banners. I don't know how many cookies you have to click away. It's really annoying and people got it. And now we're trying to figure out how to build the regulation, that we don't have to do this anymore. But it takes time. And I think it's a process. I think as a technologist, you're actually building software for humans, right? You don't build technology for your own sake. You're building in order to make something better, to do something better. To make somebody's life better. Lukas: I guess, specifically, what's a regulation that you would like to see happen? Johannes: What I would like to see happen is to allow for ML models to have a sandbox environment where you can say, ""I can do tests on real-world scenarios where it can collect data in the real world, in a given risk frame."" And then you can get risk certifications that are going up. Where it says like, ""Okay, I did my first test that was an exposure of — I don't know — a million dollars in risk."" Just an arbitrary number, don't take them for fixed prices. A certifier says, ""Okay, that's great. Now we can go to the next iteration phase."" And then you build up this risk where you can say a certifier is willing to back you up on insurance for a given risk factor. Because only then can you actually use these experimental technologies to go out into the real world. Because right now, hands are often bound, right? Like by data privacy issues, by copyright issues, by security concerns. The regulatory uncertainties around that — especially for a startup that builds ML — is really, really high. I would like to see having protected environments, where you are allowed to test things within a certain box. I think that would be a good regulation because the consumer can slowly gather trust and can see what it can do in the real world. You start to see curiosity and you have it under control to a certain extent because if the company does something wrong, it's going to get penalized and that's bad for the company. I think that would be a good regulation I would like to see, in this form or another. Lukas: I saw you also wrote on ML and environmental impact and that's something I care about a lot and have looked at. What's your thoughts there? Do you feel like people should be finding ways to train models with less compute? How do you reconcile the fact that you're also doing model training in your day-to-day job? Johannes: It's a complicated question. On the one hand, big models and ML models are really powerful and important. On the other hand, you need to make sure that you're not burning up the planet with them, right? My stance on this is, ""Let's reduce those models as much as you can."" Fine-tuning, zero-shot learning. Once you shrink them and really invest in that money, let's make sure that this cost — this carbon footprint and the monetary stuff — amortizes. That's what we're currently seeing, right? There's a lot of interest in training these big models. Pre-train them because they fine-tune very well. I just feel like there's too many people who want to just build them from scratch and not figure out what can we do with the existing ones. I hope to see a change a little bit in that. That's my take on it. It's not just like ""Shun it"", but also ""Let's be conscious about it"". Lukas: Makes sense. We always end with two questions, and the second-to-last question that we always end with is, what's a topic in machine learning that you think is understudied? What's something that — if you had more time — you would love to look into more deeply? Johannes: If I had more time, I would probably put on my physicist hat again and try to understand a lot of the optimization problems within machine learning. There's a whole field that is just ripe for discovery. Which is the combination of loss landscapes and optimization problems in deep learning models and the connection to a statistical physics. I think that is a really, really valuable lesson. It can actually help statistical physicists understand certain things better, but also statistical physics can probably help the ML community understand much better what's actually happening under the hood. I would love to contribute to this much more, but that's very far away from my own everyday. Lukas: You know, I've seen papers on this topic and I always find them impenetrable, because I think I don't have the background in physics that people are assuming. Can you describe a little bit of what this says to someone like me, who maybe knows some of the math and is interested, but doesn't quite follow? Is there an interesting result that you could point to from this analogy? Johannes: Physicists typically think in terms of what we call a phase diagram. Classic phase diagram is the different states of water. You have vapor, water, and ice. Similar effects happen in all kinds of other physical materials. One of the funny things that you can see is that these kind of phase transitions are different where you go from one phase to another phase, like from liquid to vapor. These kinds of transitions also happen in optimization landscapes of machine learning problems. For example, when you tune the number of parameters in the model, you go from the model not being able to optimize at all to the model just suddenly optimizing perfectly. People describe this as a spin class to jamming transition. Very technical term, but it essentially means from being like almost quasi-frozen state to something that is just very, very, very viscous. It's very different physical properties and you can see those in machine learning models. These are the early indications that you can use — these kind of methods and tools that we developed in statistical physics — to understand the dynamics that happen in machine learning models. Ultimately I think this will help us also train these models much better at a much cheaper cost. Lukas: Cool. Well, on a much more practical note, when you think about all the models that you've trained and put into production, what's the hardest piece of that, at this moment? What is the biggest challenge from a customer wanting a model to do a particular thing to that thing deployed and working inside of their infrastructure? Johannes: I think actually getting the high-quality data is really hard. Because that's where the customer comes in and you need to actually pick them up at that point and tell them it's not just ""data in and model out"", but you need high-quality data. We did a project for semantic segmentation of very, very fine detailed mistakes on huge metal surfaces. These are tiny scratches. You have maybe like 5 or 6 pixels on like a 10000 x 1000 pixel image. And you need to find a loss function for that. These images are recorded from various different angles and labeled by different people. So on some images there's scratch, on some images there is not. Same piece of metal, but you see the scratch and you don't see the scratch. Helping people understand how to label data, how to bring the data into a quality that the model can actually pick something up, it's really the complicated part. I think that's an understudied problem. Lukas: How did you actually get the data labeled in this case? I do have some experience with data labeling. Johannes: Essentially having an armada of people that use the labeling tool and teach them what to label for and get a huge feedback loop. Lukas: Did you build a custom tool for this? To find the scratches. Johannes: Yeah, we used open source software — I don't know actually which piece we used — and then just adjusted it for that use-case in order to make this quick and fast. Lukas: Awesome. Well, thank you so much. This was really fun and so many different insights. I love it. Thank you. Johannes: Yeah. Thank you. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So check it out.",8117 +Mircea Neagovici — Robotic Process Automation (RPA) and ML,https://www.youtube.com/watch?v=JCLwwycAHWE,2782,2022-04-21,"Mircea: The ML team has to take more chances. You cannot have the ML team work on a schedule and have clear times for when something is done. Something might never be done. It's also okay to fail. If someone starts a project today at UiPath and there is no result, but they do the right thing and you learn from that, that's a good project. Sometimes, you have to spend some time to learn that something doesn't work. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Today, I'm talking with Mircea Neagovici, who is the VP of AI and Research at UiPath. UiPath is a company you might not have heard of, but they're a leader in the space of RPA, which is essentially a way of automating a lot of the tasks that companies do. Mircea is an expert on real world machine learning, getting something working for tasks that actually matter to businesses. This is a very interesting, practical interview. Lukas: I thought a good place to start would be your current company, UiPath, because I think the applications there are things that a lot of our audience might not know is an issue for businesses. I thought maybe you could describe what UiPath does and then get into how machine learning fits into that. Mircea: Yeah. UiPath is in the RPA business. This means robotic process automation. What it exactly means is...basically, programs that can do repetitive tasks that humans don't want to do. Simple tasks. It's been a good business for the company since about 2015 or so. But if you made those robots smarter and if you put machine learning into them, then I think we can take this company to a new level. So I think RPA has a lot of potential, but RPA plus AI has a lot more potential. Lukas: I totally agree. But before we get into the AI, could you give a few examples of where RPA might affect someone in their day-to-day life? Mircea: I think in all areas, you see people doing repetitive tasks. Opening an email. Opening an attachment. Look at some data. Copy a number into a form, then go back to that email. Take another number. Put it in the form. These kind of very simple things. But they take time, and they can actually fail. Robots, once they get started, they are more reliable. Lukas: How do you actually set up, today, an RPA task? Is this a programmer does it or can anyone do it? Mircea: We have a concept of RPA developer. RPA developer is our target for most of our products. An RPA developer is not a software developer. It's maybe more like a basic developer from 20 years ago. They understand data, they understand processes, they know what has to be automated. And then they create the workflows. There is a separate question about ""What do you want to automate?"" And I think we'll probably cover that a bit later when we talk about the project. It's not always clear what to automate, especially in a big company. But once you know what to automate, the RPA developer's job is to take a process and make a workflow of. Lukas: Got it. So, UiPath has been one of those really phenomenally successful companies that a lot of people might not have heard of. What's the killer use case for UiPath that's made it successful so far? Mircea: I think it's the broad usage. I don't know if we have a killer scenario, but we are able to save cost and we are able to have those repetitive processes be taken care of by a robot, which allows people to do more other things. I think we all have experiences when we have to do something that we don't exactly like to do, like move the data from one place to another. From Excel to a form, fill the form. This is, I think, the power of RPA. Being able to do a lot of those processes. Lukas: What do you think is the current level of use of ML in UiPath today? How much ML is actually working in the product right now as opposed to in the future? Mircea: We have been starting putting ML into RPA about four years ago. Our first project was a computer vision project. Our robots usually work because they know the Windows APIs, and they know what's on the screen, where to click, and where to type. But this is not always the case. If you run in a remote desktop, there is no Windows API available. You only see a picture. If you are in an operating system other than Windows, the same thing. For us, falling back on the picture and making our robots work with the picture, without the APIs, the the was our first project. And it is a competitive advantage for us, as far as I can tell. Our competitors don't have this computer vision feature. Lukas: What exactly is it that the feature does, it like finds the button to click on based on the screen? Mircea: It finds all the controls from the screen, which are available to you if you are on Windows and we can actually use the Windows APIs. But if we have the picture, we can find everything in the picture. Then we have a design time and the run time. In the design time, we detect all the controls. People can design their workflows. And then at the run time, we have a picture that's different. Different solution; it's not the exact same picture, but it looks the same. Then we find the controls we know where to click or to type, given what we have done at design time. Lukas: So you started with this vision task, it's actually a really interesting task. And then what were kind of the follow-on tasks that you made work with ML at UiPath? Mircea: Then the next thing we realized is that we have now the control from the screen and we want to do OCR. There are many cases when we want to do a screen OCR. We were using at the time Google and others, but we thought nobody really optimized OCR for screens. We thought we had an opportunity to do a better OCR for our use case. We implemented an OCR around three years ago. It was not different than others.I mean, it's still the same idea. You do detection first, you find where the text is, and then do recognition. We didn't invent a new OCR, but we did train on our own data and our own use case. We built a significantly better OCR in the process. The same thing for document OCR later. We don't have such a big advantage in performance, but we have more flexibility. We can put it on device, we can put it in a services. We can ship it in any way we want. So OCR for screens and OCR for documents was another project. And then also during 2018, '19, we were hearing from customers that they want to do document processing. Emails, semi-structured content, unstructured. There is very many scenarios from very many customers we've seen. I think it was quite clear that the number one thing in document processing is doing information extraction from semi-structured content. Invoices, receipts, purchase orders. And then we made some models that can actually can read those documents and extract what we really care about, including — like I said — receipt invoices, purchase orders. And now, we have 15 or 18 document types. W-2s, W-9s and so on. We do some classification for those documents. We have models that do informational structure from unstructured contract, like legal contracts, lease contracts. We've put quite a lot of effort into document understanding. Lukas: I mean, you have a lot of custom models running in production. I feel like compared to a lot of companies, you probably have a more advanced kind of operational setup than others. I'm kind of curious what the structure looks like. Mircea: We made the decision in 2018 to make a framework for hosting those models. The interesting thing is that we don't only want to do hosting. We also want to allow customers to fine-tune our models or to train our models. And then at the time, we didn't see anyone to partner with this. There were some solutions in the cloud, but nothing on-prem. A lot of our customers have more trouble to move from on-prem to cloud for our scenarios. People now accepted that email and documents are okay in cloud, but when it comes to processes and invoices and stuff, there is more reluctance. So we put a lot of time and effort into this. It's a big engineering project. It's also an AutoML project, but a big engineering project to build a framework that can host and train the models on-prem, online. And there are multiple configurations online. We call it now AI Center and everything we do is hosted or trained in AI Center. It is a very big project. Lukas: And all these models can be fine-tuned, so you have to have kind of separate instances? Mircea: Not all, but many of them. We don't allow the computer vision model to be trained by customers. Or the OCR. Although for the OCR, we have to get the feedback from the customers and improve. But in document ynderstanding, most of our models are retainable. And this is why. We have a model for receipts, a model for invoices, and basically, have one big model with multiple tasks. But then, when customers start to use this out-of-the box model, either they want to fine tune or dent on their data, which basically means overfitting on their data — but this is a good thing for them — or they have a bit of a different schema. Or they have a totally different schema. So for all those cases, they have to fine-tune our models. Lukas: So I mean, 2018 isn't that long ago by the calendar. But I feel like in terms of ML frameworks, it's kind of ancient history. I mean, what did you end up choosing for your ML framework? What are these models training in and how do you actually deploy them? Mircea: We started with a mix of PyTorch and TensorFlow. Lukas: Oh, a mix of PyTorch and TensorFlow? Wow. Mircea: Well, we didn't mix them on purpose. We preferred PyTorch from the very beginning, but our computer vision models...at that time, it was a lot easier to do this in TensorFlow. Google had a research repo implementing Faster R-CNN. It was exactly what we needed. We took the model and trained with our own data. So we used both. In document understanding, we used PyTorch* (*TensorFlow) in the very beginning. And then later, it became easier with PyTorch. And actually, we also got a bit of a performance boost with PyTorch. I mean, quality performance. So at this point, everything we do is in PyTorch. We train the models and then we ship these models in AI Center. For the customer, they only see AI Center. We cannot do AutoML. We don't expose that many hyperparameters. There are very few things that we expose. The other thing I want to mention that was very tricky is we have to...for people to train our models, they can fine-tune our models in two ways. One is if they label data. This is what they do before they deploy; they label 100, 200, 500 documents and we give them our own tools to label. Or they can fine-tune on the data. Once we deploy, we have a human-in-the-loop concept and a validation station, and someone does fix our mistakes for the workflow to continue. And from those mistakes, we close the loop and we do learning. So those are the two types of data used for learning. So to come back, when we do a release, we make a branch, we train our models, we basically ship containers with code and models. And then in AI Center, they are hosted and trainable also. Lukas: I guess back in 2018, there was definitely a sense that PyTorch is kind of the framework for research where TensorFlow was kind of more for production and deployment. How did you think about that? What was the key feature that made you prefer PyTorch to TensorFlow and choose to standardize on it, even though there was some models in TensorFlow that felt like they solved your needs really well? Mircea: It was how debuggable PyTorch, that it was really...we optimized for developing faster. I mean, TensorFlow is a good framework, but it is really hard to use. It's been always hard to use even after they did TensorFlow 2.0. We didn't have such a big issue with performance. Our computer vision model runs on GPU and we have...it runs in a subsecond. So basically 0.5, 0.6 seconds, you cannot really see. Human can only notice things that take more than 0.7 or 0.8 seconds. So we did not have an issue with CV. But also, when we moved from TensorFlow to PyTorch, our PyTorch inference was a bit faster. We didn't exactly understand why, but it was definitely not slower. But in any case, we did not have a big performance thing. And then the most of our document understanding models actually run on CPU. And the request takes a second and a half, two seconds, something in this range. And for document processing by the robot, this is fine. So clearly, in some scenarios, TensorFlow was faster and PyTorch was too slow, but that didn't happen for us. We just didn't have a performance issue back then. Lukas: Do you do any kind of performance monitoring? I guess you have this human-in-the-loop system to catch issues of the models feeling uncertain, but are things like concept drift and data drift stuff that you actually kind of watch in production? Mircea: We have to do a lot more here. Drifting is a concept that we have to be concerned about. But for us, even before that, before the drifting, we have a hard time telling to a customer if their data is good enough or not for training. If we don't say anything and then they start a very expensive labeling process and training process and then we say, ""Your model didn't work because of the data."" Why didn't you say something before? We don't have a good visual way to tell people, ""You have to label this much. You are now 50% done, 70% done,"" or ""Label more of this, label more of that."" This is an issue for us. And then of course, the drifting. But we didn't solve the first problem. Lukas: Do you do any kind of active learning in the labeling? Do you try to pick examples that are going to help the model the most? Or how do you think about that? Mircea: That's another thing that is kind of a debt we have to do, more active learning. We do mostly supervised learning. We now know how to also do pre-training on unsupervised learning for document understanding and for CV. Active learning is something that we are now thinking about, it's not something that we shipped. But clearly, it is our way forward. Lukas: Where do you think this goes? What are applications that you're really excited about building new models to do? And what would that help UiPath do that it can't do now? Mircea: We have a very interesting project called task mining. Task mining is a product that runs on people's desktops, records what people do, and then has a nice, interesting algorithm to find what are the most common processes. If you ask a CIO what to automate, they have a hard time to say exactly what people are doing, especially in a larger org. So we built this task mining product that instead of having analysts and a lot of people talking and figuring out what has to be automated, we try to discover this thing ourselves. It is a very interesting project. It has a lot of potential for us. Basically, we start with pictures. We have a recorder that knows when something relevant happens like a click or a type. And we end up with two weeks of recording for let's say, 10, 15 users. And then we have to find some processes. It's a very, very interesting product and we have a lot less research from the big companies or from the universities. Nobody's really doing research on this. In CV, in DU, you can just read a paper, you know what's going on. Not so much here. So we have to do our own research. That's one project that we are very excited about. Lukas: So the idea here is you could look at what people do over and over, where you're confident they're going to click on something or type something? Mircea: No, not that. That one, we think about that one too. Recording people and believing what's the most likely thing they are going to do, like a language model for actions. But task mining is actually different. We look at the recording after two weeks, let's say for 15 people, and then we find the processes. Lukas: I see. Mircea: We found that the best process to automate is for example, invoice processing. Or the best process to automate is some look up that starts in some browser, does a look up, goes to Excel. These kinds of process. We just find better candidates for automation. This is just a short summary. In reality, it's a little more complicated. And we don't exactly find the process. We build an explorer for the customer to find it. But still, we believe take a process that takes, on average, 50 days down to maybe 2 days or something like this. Lukas: Have you tried this on yourself? Imagining, I wonder what it would see me doing all day long. Mircea: Well, us developers and engineers are not really good...if we record us, we'll see probably a lot of random stuff and coding and more watching and debugging. I cannot see myself recording something that brings any value to the product. Lukas: I'd be just scared to...I mean, I'd be interested, maybe afraid, to know how much time I spend sort of moving around meetings or sending emails. Mircea: We don't do a Big Brother kind of thing that tells you what to do and how you waste your time. We don't make people feel bad about it. We're just trying to find the real processes and not all the overhead and the distractions. Lukas: I see. It does seem interesting though, to predict where somebody is going to click or what they're going to type. I can imagine you can make interesting UI changes to help somebody if you can sort of know what they're likely to do next. Mircea: This is, for us, one of the things we want to look in the future. Can we tell what people are going to do? And assuming we can, what do we do with that information? Let's suppose you click in three edit boxes and now we know you are going to click in the fourth. What do we do? We cannot take the mouse from you and start without telling you. It's like autopilot, we cannot...so we don't know the experience. We don't know what a good experience is. But so far, we don't even know how to do that. The other thing we can do that's probably a bit easier is, we can see when you create a workflow. And then we can tell you that we see you doing a few clicks and a few types and we recognize that this is actually an action that we know. So we have those simple activities when we create workflows like ""Click"" and ""Type"" and those kind of things. But we can also have more complicated activities like ""Create user in Salesforce"". We can tell after we do 3 or 4 things, we can maybe tell that you are going to do 10 more. And all those 15 steps in the end are just 1 activity, which is ""Create user"". This is the kind of thing that I think is a bit closer to us. But yeah, the ultimate goal is to just have the computer do the human work with minimal intervention from the human. But I don't think we are that close. Lukas: Interesting. Have advances in language models... I feel like since 2018, languages models have gotten kind of much, much bigger and kind of better at predicting words. Has that affected you at all? Do you use these modern, gigantic language models in your product? Mircea: We use BERT models. We use all the big models. We don't do a GPT-3 kind of thing. Although we did some experiments with it. We don't do zero-shot learning just yet. So we don't use a language model for this kind of predicting the next words thing. But we do use the large models, trained with mass language models. We use them in unstructured documents and we use them in semi-structured documents. There is a model called LayoutLM built by Microsoft, and that's a Transformer in 2D. That one is useful for us, for the semi-structured content. Lukas: Cool. It's funny, going into this conversation, I was prepared to ask a lot of questions around the mix of traditional ML and deep learning, but you seem — very much more than I thought — using primarily deep learning models. Is that accurate or do you do any kind of traditional machine learning as well? Mircea: We try to use the best tool we know for a task. We don't say, ""If it's not neural network, it's out"". We have all sort of smaller things, smaller classifiers that just use bag-of-words and trees and those kind of things. We have reasons for classification to use simpler models because they are more explainable. Or easily explainable. We usually offer choice. In computer vision, in OCR, we don't have a simple model. We have to use a neural networks. But in document understanding and especially in classification, we have other methods as well. Lukas: Interesting. Can you give me an example of how a model might give you more explainability and you would pick it? A lot of people talk about that, but it's hard to get real case studies. Mircea: We had a customer who wanted to classify documents. They want to do two things. After the model is trained, they want to see which words or which features define each class. But then they also want — at inference time — to tell which words were the main contributors to a prediction. It was a very interesting conversation we had with the customer. Before that, we were talking about explainability in more abstract terms, but this was a real use case. At predict time, they wanted to see those words who actually contribute to a prediction in the evaluation phase. But I'm pretty sure they also wanted it when the model is deployed. Not everybody will look at those words, but they want to have the option — when the model is deployed — to see the weights on those words. You can do the same thing with a BERT model, but it's more complicated. You have to get the tokens. It is definitely simpler. And also, the other thing I want to say is that we are not going to train for a customer a BERT model that takes eight hours to train or fine-tune when we can train a bag-of-words model in five seconds with similar or better performance. Lukas: Where do you think the kind of cutoff is? At what point would you switch from a bag-of-words to a more complicated model? Mircea: I think this is really hard to say. In some cases, we have to try both to know. We have some guidance maybe, but we cannot really tell. I think it depends. It depends more on the content, I think, than the size. It's a mix of number of documents. Most of our customers have very few documents and they expect us to learn from a very, very, very small number of documents. For example, they believe if they give us two forms...if they have two templates and they give us two forms for each, we should be able to do something. And that's a reasonable expectation. We have some more traditional models — no deep networks involved — that actually do just that. You give us a document to look at it and we remember it. You have a second document that we believe is the same. And then we are able to match them. We call this Forms AI, it's our newest feature. And this one doesn't use neural networks. It's just matching and searching and more traditional things. But I think what we are going to do is...when people have documents, we don't want to ask them to start with 1,000 documents or even 500. That's too much. There are cases when the documents are very much the same. And then, we should start document by document and use simple techniques internally. We should not even tell the customer what to do. But if the documents are kind of the same or very — actually, the documents are very much the same — then we can deal with them without neural networks. But if they keep giving us documents and we keep making mistakes after 5, 8, 10, 15 documents, there is a cutoff point where we say, ""This template is just too complex for our simple tool."" Our simple tool is more like a vehicle to get you started. Where we end depends on the content. Lukas: Interesting. That kind of reminds me, do you do any kind of AutoML? Is hyper-parameter search something that you do all the time or in certain cases? How do you think about that? Mircea: We implicitly do AutoML. You cannot...at this point in 2022, you cannot tell a customer that, ""We give you a classification, but you have to change the learning rate, you have to change the batch size.: You cannot do that. You have to find a way to do auto... Whether you like the word, the term or not, you still do some sort of AutoML internally. There are models that are kind of easier to generalize and you don't have to change as many hyper-parameters, and some of them are harder. But the ideas that you had before, like if you remember the Azure ML product where you give people 50 choices and the 50...I think we are past that and people expect you just figure out what to do. But if you internally want to train 1 model or 50 and choose the best one, I think that's up to us. Lukas: But it's interesting because it seems like from a lot of the examples you gave, sometimes your goal is not to make the most accurate model, but the model that will kind of fine-tune the best on customer's data. Does that mean that you're optimizing something special? How do you know if the model's good in that kind of situation? Mircea: We have some evaluation framework. But you're right, we don't necessarily... Let me give you an example. If you train a model for too long, you might end up with a slightly better model, but the confidence scores are worse because the way that the overfit works and the way that the numbers get too close to 1. Basically, you make very...you are very confident for wrong predictions. You get most predictions right. But the ones you get wrong, you are very confident. And this is a thing that we have to figure out. What is the trade off between overall model performance and other things? Fine-tuning is an aspect. One thing that our customers really care a lot about is our confidence scores. Everybody will take us a model that's 3 points worse in terms of quality, if the confidence scores are perfect. Because the confidence scores will tell them when to get a human involved. So yeah, it's not only about getting the absolute better, best model, like a paper kind of goal for us. The goal is to make the product work, not necessarily just have the highest score for the model. Lukas: I really appreciate that perspective. I guess, switching gears a little bit, but something I really wanted to cover is looking at your background, it looks like you've gone from more traditional software engineering to running a machine learning organization. And I know from talking with people that enjoy these interviews, that's the perspective of a lot of people watching this. So I'm kind of curious if that's actually true, if you kind of learned machine learning mid-career? And either way, if you have any advice for someone that's trying to do the same type of thing? Mircea: I was at Microsoft for a very long time doing software engineering. And then after 12, 13, 14 years, something like this, I wanted to do something new and I didn't exactly know what to do. I was very lucky to talk to a few people at Microsoft that actually made me see there is this machine learning opportunity. Then I started to learn and I was really fascinated to kind of go back into learning mode. Now that I look back at the last 20 years, I had a gap. I kind of thought you joined Microsoft to learn on the job. And this is true to some extent, but I don't think it's enough. So then, I went back into more learning mode and did some math and some statistics that I had not done for the previous I don't know how many years. And then after about 18 months or so of doing this for maybe 4, 5 hours a day, nights, and weekends and so on, then I thought I was ready to change jobs. I moved from my previous engineering role to a Microsoft Research team. That was a very good move for me. I was just learning these things. So they hired me to help them more do the engineering, but also, they understood that I want to do more machine learning. But then, I thought actually, I also want to go back to school. And I started a master's program in computer science at UW. So yeah, so basically, what happened is that I spent about 4 years or so learning online. Coursera, at the beginning. And then this master's. And then I was able to transition from a software engineer role to this ML thing. Lukas: Do you have any advice for your younger self or someone that wants to make this transition? Mircea: I think they have to be motivated. This is a long journey. I think if you believe you do this in two months, I think that's not setting the right expectations. You have to be prepared for a longer transition. And I think you have to go back and do some math. It depends after how many years you want to transition. It's a lot easier to transition early. And also for younger people who now go to those good universities, they have good knowledge about math that's fresh in their mind and they have good ML courses if they are interested. So I think what I can say is more for people who've actually spent 10, 15 years in software engineering. Just prepare for a longer journey and try to learn the fundamentals. If you rush into it and you...it is not enough to be able to say ""model.fit"" and put some parameters in that. That's not going to do it. I strongly recommend those master's programs. I think they are good programs and they kind of force you to put more time, and you have to do a lot of projects and homeworks. The other thing I thought was a good resource is to do Kaggle competitions. was in three of them and it was just a great experience, but very...the second part of it was very intense But overall, Kaggle is a great resource, I think. Lukas: I love that answer. Do you think that your background in software engineering makes you approach machine learning differently in any way? Mircea: I don't know what to say about that one. I think it's good to have some software engineering experience. A few things happen. If you don't do software engineering for a few years – I didn't do software engineering for 5 years now — you are not current anymore. Things happen that you don't exactly understand. I hear people talking and more and more it happens to me that I don't understand what the details of what they're talking [about]. I think it's very hard to do ML and do engineering — basically both — at a good level. This is why, at UiPath, we have a separation between more science/ML team and the engineering team. But I think it's good to have the background. It's good to understand memory and processors and threads. Lukas: Are there differences in the way that you think teams should approach an ML problem versus an engineering problem? Is even the cadence of shipping different? Mircea: The ML team has to take more chances. You cannot have the ML team work on a schedule and have clear times for when something is done. Something might never be done. It is also okay to fail. If someone starts a project today at UiPath and there is no result, but they do the right thing and you learn from that, that's a good project. Sometimes, you have to spend some time to learn that something doesn't work. It's harder to do this in engineering. In engineering, you have more strict schedules, more products, and all those hurdles and sprints and so. So yeah, I think you have to to organize somehow differently. We are a more hacker kind of org than engineering. We're also more flexible, easier to move people from one project to another. For us now, our CV model, our DU model, our task mining model, they have a lot of things in common. Lukas: It's funny. We were talking to Jeremy Howard, the fastai founder, and he was saying that he thinks that engineering software is kind of more fun because you make incremental progress that you can really see. And I was kind of reflecting on that. I think my background is more in ML, but actually adding features to the Weights & Biases product is definitely more satisfying for me than training ML models. I feel like ML models, mostly they don't work and the debugging cycles are way longer and harder. Is that consistent with your experience? Or there must be something about ML that you love. Mircea: Yes, but I mean, we do new features, although we don't...I mean, it depends how we define engineering. I think the way I look at it is this. You have people who do research science and they write papers, they create new knowledge. We don't do much of that. We have one researcher and we want to hire a second one. But for the most part, we don't do research. We do applied science though. Most of our team is an applied science team. So we do build new features and it is...our work is, I think, maybe 10% in training the models. And the rest is to just make something happen. Make them work somehow, put them together. But then, it's the engineering team who actually puts those things in production and create the containers and deploy in our data centers and takes care of scale and availability and all networking and all the other things. So that's why I'm saying it depends where we draw the line. We, in this team, don't just train models and then tell others, ""Okay, take the models."" We do the post-processing. In most cases, there is more post-processing than the model itself. We do pre-processing, we do data manipulation. So we build some feature, not just a model that doesn't do anything, it's just nice and shiny. But I know what you're saying. We also like to build features. Lukas: Are there different ways that your team collaborates together? Is it a different kind of collaboration than an engineering team? Even though I know you're applied science, it's still kind of a different thing than software engineering, I think. So are there kind of different ways to do code reviews and things like that on your team? Mircea: We have less process than engineering teams I'm aware of. I don't know in detail how an engineering team functions now, but I think many things are in common. We want people to write good code, we want people to write the simplest code possible and not complicate things to the point that nobody understands. So there are some things that are similar, but there are also some things that are different. When we merge a PR, we don't ask people questions, like, ""Have you seen this one in production? What is the impact? What is the latency difference? Where is the telemetry?"" Although we want to have telemetry. I think the coding part is quite similar to engineering, but the way we change our mind and the way we choose the project and what to do and what to not do and the flexibility, I think, is the main difference. Lukas: Interesting. Mircea: And then the testing is also something that we are very...I mean, testing is very important. You cannot ship a good product if you don't have unit tests, automated tests, and so on. So some of those end-to-end testing are owned by engineering, but we also do significant testing. So that's another thing that's similar between us and engineering. I think it's really the flexibility that's different. If we now believe a project is really important, we can easier move people around. We now have a semantic automation project that tries to make the robots understand better what's going on, not just click and type. And this is a mix of CV and document understanding. And we can apply the same knowledge or we can use the same graphs. Yeah. There are many, many, many things that are similar between our team and engineering. Lukas: Interesting. Well, we always end with two questions and I want to make sure that we give you some time to answer them. So, what's an underrated aspect of machine learning or deep learning that you think people should pay more attention to? Or maybe what's something that if you could go back to school or had more time to look into, you'd spend some time engaging with? Mircea: I think there are two ways to answer the question. I think people spend a lot of time on models and I think people should spend more time on data. This is changing in the last year or so. You see this more and more. If you want to improve the product, you look at the data more than you look at the models. So that's something that people are talking about. To me, I would also like to see more effort into more business kind of data. All those nice models are trained on Wikipedia and...but customers have very small datasets with all those semi-structured things; there are no paragraphs, no sentences. It's quite hard to take a good BERT model — or all those NLP models — and apply them on the documents that you see in the enterprise. There is a lot less context. The graphs are less connected and so. So this is about datasets and about customer data and business data. Lukas: I totally agree with actually all the points that you just made, but I guess I want to ask about the data thing. People have been noticing that it's a better use of time to spend more time with data for 20 years at least — as long as I've been kind of watching it — and yet it seems so hard to get teams to look at the data as much as they should, by team's own admission. I mean, what do you think is going on there? Why is it so hard to orient more towards the data than the models? Mircea: It's not clear who is motivated by that job. I mean, people have been talking about it in theory. Not really do anything about it. And if you look at what...people really love to train models. Even before the neural networks, people love to train trees and so on. But not many people are passionate about the data in itself. All our good people do the data manipulation and the clean up of the data just to build better models. There are now companies who help with the data, including what you guys do. But what is the profile of a person for us to hire to actually really focus on the data is unclear. Do you want to have software engineers, you have data engineers? It's unclear. I think this job really belongs to the applied scientists, but they rather do something else with their time. So I think this is why everybody says we should do more progress, but actually, nobody really does. Lukas: Right. Okay. That makes sense. My final question for you — and this is an interesting one, because you've put probably more models into production than most people in the world, most people on this show — what's the hardest part about getting a model from conception to running live in production? Mircea: When we build something, we start to do the ML parts. We see if a project has legs. But to ship it, you need a big machinery in place. You need testing and you need engineering and you need product and you need alignment and people to sell to customers. The real thing, not to oversell or undersell. I think building this whole machinery is, to me, the biggest part. In the end, when you are done, you realize that the ML part that you love so much is just a small thing. Whether it's 10% or 15%, I'm not sure, but there is a lot more work on top of that. People in ML should give more credit to engineering and product managers and pre-sales because without those people, there is no ML in production. And then having everybody aligned and kind of see the same...go in the same direction, this is tricky. The other thing that's tricky is to have more experimentation in the product. We struggle with convincing our product managers and our engineering to do more experiments. Put more stuff into their code so we can experiment and maybe ship a better product. It is very hard to take time off the schedule for something that has the potential to give you nothing. On the other hand, if you don't do this and you are only...if you only exploit and don't explore, it is not good. So this is another tricky thing. How do you convince the whole org to have the right mix between exploring and exploiting? Lukas: Awesome. Well, that's a great answer. Thank you very much. Mircea: Thank you. Lukas: This was super fun. I appreciate it. Mircea: Very nice talking to you, Lukas. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So check it out.",7432 +Jensen Huang — NVIDIA's CEO on the Next Generation of AI and MLOps,https://www.youtube.com/watch?v=kcI3OwQsBJQ,2935,2022-03-03,"Jensen: For the very first time in human history, we are producing, manufacturing intelligence, like production. Raw material comes in. A lot of genius goes into that box. And what comes out is intelligence that's refined. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald. Today on Gradient Dissent, I interviewed a guest that I've been looking forward to interviewing for quite a long time. This is Jensen Huang, who is the CEO and founder of NVIDIA. If you've trained a machine learning model, you've probably trained it on NVIDIA hardware. We get into machine learning and we talk about his views on what the future holds. This is a super fun interview, and I really hope you enjoy it. Lukas: All right. Well, thanks so much for doing this. We collected questions from our community; they had a ton, so there's more questions that I'm sure we can get through. So I'm going to get into my questions first. Jensen: Okay. Lukas: I wanted to start with the number one question I wanted to ask you which I've always wondered about. Which is, I think almost everyone training machine learning models these days uses NVIDIA, and I was really curious about how conscious of a strategy that was. Like when you started to think about it and how you made that happen. Jensen: It started when almost simultaneously, three different research teams reached out to us, asking us to help them accelerate their neural network models. Turned out the reason for that was because they were all trying to submit for ImageNet, the big competition. And so deep learning came into our consciousness kind of around that time. Lukas: What year is this? Jensen: This is...when was Alex's ImageNet? Lukas: It must have been like 2011, maybe. Jensen: Yeah, I was going to say 2012 or 2013. But anyhow, it's something like that. Anyways, AlexNet, it was that year. It was kind of around our consciousness around that time. The thing that was really exciting was...we all know that computer vision was hard to do. And for Alex to have created a neural network, trained it on a whole bunch of data, and broken a record of computer vision experts — of which many of them were at NVIDIA trying to do the same using human engineered features — that giant breakthrough caught a lot of our attention. Computer vision, as you know, is one of the foundations of artificial intelligence and all of a sudden a giant leap happened. And when discontinuity happens on something that important, it really caught our attention. I think the difference between what happened around the rest of the world versus us is we took a step back and we said, ""What is the implication of this?"" Not just for computer vision, but ultimately how software is done altogether. Recognizing that for the very first time software is not going to be written — features weren't going to be engineered or created by humans, but somehow automatically extracted out of data, refined out of data to recognize patterns, relationships, and somehow learn the representation of some predictive model — that observation early on caused us to ask the question, ""How does this affect the future of software?"" How does this affect the future of computer science? How does this affect the future of computing? How would you change the way a computer...if the way that you write software is different, then how does it change the way you would design computers? And if the software that's written is written by a computer versus a human, how does that affect the type of computers you would design? We had the good sense of thinking about it — from first principles — the implications for the entire field of computer science and the entire field of industry. Which ultimately led to asking the question, ""What about the implications to all the different industries?"" I think that the good fortune was we were interested in computer vision. We saw the gigantic breakthrough from Alex and Geoff Hinton and the folks at Toronto, and we simultaneously were working on it with several other labs at the same time. So I think it was partly good fortune, partly having the sense to realize the profound implications to computer science, and then asking ourselves what the implication is for everything. Lukas: I think one of the things that you've done amazingly well is just stayed dominant in this space. You might've had a head start, but of course, lots of other people have noticed that this is a really valuable space. And I've been hearing since maybe 2014 companies saying, ""Hey, you know, we're going to make the next deep learning training...GPU or TPU,"" or something like that. But you've actually really maintained this ubiquity in the market, and I wonder what you attribute that to. Is it more the architecture of the chip or is it more the software, like CUDA and cuDNN? Or is it something else that kind of keeps you ahead of the competition? Jensen: Well, partly because the company was formed properly for this opportunity. We were always in the field of accelerated computing. If you could go all the way back to computer graphics, all the way since the beginning of our company, this new way of doing application acceleration — domain-specific application acceleration of which computer graphics is one; scientific computing and physics simulations and others is another; image processing, for example, is another; you could argue that deep learning is yet another — these different domains of applications, the company was started with that mission in mind. Now, in order to do accelerated computing — domain-specific accelerated computing — you really have to be a full-stack company. You have to understand what is the application, the nature of the application you're trying to accelerate. You have to redesign the algorithm, because the way that you would write an algorithm, development algorithm, for sequential processing is radically different than parallel processing. Algorithm engineers, our company has a richness of algorithm engineers. You have to think about the system software and the systems differently because the workloads change the bottlenecks. And so you have to think about system software differently, you have to think about systems differently, you have architecture of the chips differently. Our company is fortunate that we are a full-stack company that goes all the way to the research of algorithms. That's what it really takes to be an accelerated computing company. But I think the advantage that we have is that we've been a full-stack computing company for a very long time. We have taken that skillset from computer graphics to imaging, to scientific computing. And then when deep learning came along, it was a problem that our company was very adept at solving. Lukas: That's a good segue into a question a lot of people had that I have also, which is, ""Is there a lot of tension between the needs of gamers and crypto miners and scientists and people in deep learning?"" And how do you trade those off into a single chip? Like how do you prioritize the different needs? Or maybe there's no tension because everyone has to do the same type of workload. I'm curious how you think about that. Jensen: Yeah, there's absolutely a tension. For example, scientific computing...because it has a large body of historical code, and they could be in FP64, whereas for consumer applications, FP32 is just fine. Whereas for, deep learning, it's quite a large amount of different types of format that could be used. The nature of the processing could be a little different. Sometimes it's very dense computation. Sometimes it's a little bit more sparse computation. Ray-tracing, for example, is very sparse. Rasterization on the other hand is rather dense. Image processing is rather dense. You have different computation natures. You have different precision that you have to support. Each one of the industries have a very large number of applications that are in use that you want to support and be able to accelerate. And so each one of these industries are a little different. We try to build...we build a GPU that is universal, in the sense that all of these applications can run on any of our GPUs. And that gives developers a very large install base to target. They know that when they develop on our architecture, it'll run it everywhere. The only question is, in each one of the processors that they run, is it better for scientific computing or is it better for machine learning or is it better for imaging or computer graphics? We shape the size of those capabilities — those functionalities if you will — for the different applications, different markets that we serve. In the case of GeForce there, there is no FP64 richness, although it runs. It runs rather slowly. In the case of deep learning chips, it'll run computer graphics, but it will run less well than GeForce and so on, so forth. We adjust the size of the functionality to the market that we serve. In combination with the software stack that goes on top of it, we should be able to bring the best products for the use case. Otherwise, everything is universal and everything just kind of works. Computer graphics, scientific computing, training inference. We really believe that developers ought to have the largest possible install base and not worry about whether the software's going to run or not. It should all always run. The question is whether it runs to its fullest capability. Lukas: I see. Well, another question along those lines is, do you think radical changes are coming? In particular, do you think quantum computing is something really relevant to you? Like something that will be a practical reality in the next...in our lifetimes or the next five to ten years? Jensen: It will definitely be in our lifetime because, Lukas, you and I are still pretty young. So we'll definitely see it. However it's not likely to, in the next five years, to be generally useful. On the other hand, the important thing is — and this is really the marvelous thing about machine learning and deep learning — in many of the applications, whether it's drug discovery or large combination planning and optimizations (pathfinding, traveling salesperson problems) — these type of problems people have historically thought would need quantum computing...because of machine learning, because of AI, we've made giant leaps. It's not, you know, Moore's Law-type leaps. If you look at the body of work of your customers, and our customers, and the scientists that work in both of our companies, in the last 10 years, where Moore's Law — if it was moving at full rate — would have increased performance by probably 100x, many applications — because of machine learning or deep learning — it's improved by 1,000,000x. Lukas: Totally. Jensen: We've improved performance by a million times. And over the next 10 years, I fully expect that — because of a couple of different innovations between accelerated computing and the further advances that we're expecting in deep learning, and this new field called physics-informed neural networks, we're doing some really fantastic work there — in many areas in science scientific discovery, we're going to see probably another 1,000,000x. 1,000,000x advance is something that's kind of hard to wrap your head around. But we're going to see that in so many different fields, whether it's in healthcare or climate science or other fields of physics that are really important to us. Lukas: Are you someone that believes that we'll see AGI in our lifetime? Do you think the singularity is coming? Jensen: I don't know about that. However, if we reframe the problem, if we reframe the question just slightly and say, ""Will AI be able to do things that are much better than humans can?"" You and I both know that, in fact, if you reframe the question that way, AI in many, many fields are already superhuman. And I think that the number of superhuman skills that AI will learn over the course of the next decade...it is quite extraordinary. I doubt that there will be many manipulation tasks that are repetitive, that robotics won't do better than humans. Which is one of the reasons why there's so much work in surgical robotics. Their hands will never shake. They'll be able to make the most minute and the most precise of incisions, and its perception ability is going to be incredible. So I think that in the coming years, we're gonna see superhuman AIs. They won't be like us, but in many domains of activities, they'd be quite incredible. Lukas: But I imagine where you sit, you're watching AI help with chip manufacturing and design better chips. And you're probably seeing that have compounding returns, which I think is sort of the thesis behind the singularity, right? It's sort of, AI starts to create AI. You just see this exponential... Jensen: That's exactly right. Look, we're not going to be able to build next-generation chips without AI. And that's kind of a remarkable statement. That all of the chip design process, the architectural process...today we have 5 of the world's top 500 supercomputers in our company, and we are producing software that gets shipped with all of our AI chips. Without AI, we can't produce software that runs the AI. And in the future, without AI, we wouldn't be able to design the chips that we use to run AI. So that's right, the circular, positive feedback system is about to go into turbocharge. I have every confidence that the next 10 years, we're going to see even greater advances. Not necessarily at the transistor level, but absolutely at the computation level. Lukas: Do you have any concerns about...as compute gets more and more important to advances in science, that there's impact on the climate or even impacts on access of who's able to make scientific discoveries or who's able to kind of make the next really exciting company if they need a supercomputer to do that? Jensen: First of all, one of our greatest contributions to the industry is we democratized scientific computing. Because of NVIDIA GPUs, the breakthroughs for AlexNet wasn't a supercomputer in the cloud, it was a GeForce card. Simultaneously, researchers around the world were buying GeForce GPUs. And because architecturally they're all the same as the supercomputers we're building, they were able to use that to discover the next...the breakthrough that we're all enjoying today. The same thing is happening in so many different fields. And so I'm really proud of the fact that we've democratized high-performance computing. We put it in the hands of any researcher. They don't have to go get gigantic funds to be able to do their research. One of the scientists that was in quantum chemistry said to me one day that he had learned from his son, who was working at one of the computer companies here in Silicon Valley, that he should go and buy our gaming cards, and download the CUDA SDK, and port the quantum chemistry software that he was running on an IBM supercomputer onto our gaming GPU. He was so amazed how fast it was. He had to wait for the rest of the week for the supercomputer to finish, so that he could compare the results, that it was the same. And then he went and bought as many GPUs as he could from the retail stores and made himself a bespoke...a homemade supercomputer. Lukas: That's awesome. Jensen: He said to me, ""You know, Jensen, because of your work, I'm able to do my life's work in my lifetime."" In a lot of ways, we built him a time machine and he was able to see the future in a way that he otherwise couldn't. So I think the first contribution is we democratized scientific computing. The second thing that we did...because of artificial intelligence and this idea of pre-trained models and transfer learning, we now have the ability to essentially have large companies pre-train intelligence. It's almost like creating a whole bunch of new college grads — super well-educated college grads — that are now going off into the world, that people can then adapt to their particular skills. In a lot of ways, Lukas — the work that you do, the work that I do — what we've done is we've actually lowered the bar. We've democratized intelligence. We democratized computer science so that almost anybody can download a pre-trained model and perform superhuman capabilities for their application domain by retraining it, by adapting it, by applying a transfer of learning capability to it. I think artificial intelligence is the most powerful force that has come along. And one of its benefits is going to be to democratize computer science. Now, one of the things that you mentioned earlier about energy...I think that one of the greatest projects we're working on is this thing called Earth-2, which is a digital tool which...we're going to try to build a digital twin to mimic the climate of the earth. It's a multi-physics problem, thermal dynamics and fluid dynamics and chemistry problem, and a biology problem, and the human driver problem, and economic problem. All of it contributes in this geometry-aware...because, you know, terrain matters and multi-physics...and we finally might have the necessary algorithms to be able to take a swing at this and build a full-scale digital twin of the earth. And hopefully inspire us by giving us a model to test our mitigation strategies, and our adaptation strategies, and simulate whether the technologies we're going to use to absorb carbon or carbon emissions will have the necessary impact a decade, two decades, four decades from now. If not for deep learning and the work that we're doing, that wouldn't even be possible. I wouldn't even imagine doing it. Lukas: Cool. One of the things I wanted to make sure I asked you, on a personal level, is I've really admired how you've run the same company for a really long time. It doesn't look like an easy company to run. I mean, there's a lot going on, and a lot of physical things, and it clearly hasn't just been this rocket-ship SaaS startup. And yet you seem very technically current. It really does seem like you stay on top of trends and keep a level of technical depth. I was wondering how you do that, how you stay educated about what's going on in scientific computing and machine learning and other topics. Jensen: Well, I'm a little sleepy right now because I was up at three o'clock reading and...there's just no other way. I think you just have to keep on learning. Lukas: You're just interested in the topic and you just- Jensen: -I don't know. I don't know that there's...I wish, Lukas, there was wisdom to pass. I paused for a second. Was there a secret? Nope. I think partly, of course, is really, ""Where's the energy and the curiosity juice coming from?"" Being surrounded by really bright people, you learn from them, which allows you to combine a lot of your own understanding. And when you decode a puzzle or you learn something new, it really gets you fired up. I think one of the most important missions — and the purpose — of a CEO is to create the conditions where amazing people could do their life's work. I really take that very seriously. I try very hard to create a condition where amazing people could come and be surrounded by other colleagues that are incredible. That, I think, contributes a lot to it. And then the rest of it...as a CEO of a tech company, you really need to enjoy learning about what's happening in your company — which has plenty to learn — and what's happening around the industry, and see if you could imagine a future that's better for everybody. Lukas: I think a big part of my learning process that's hard to do running a company is tinkering and stuff. I'm wondering if that's...I think you're originally an engineer. Do you find time to ever write a little code or put something together? Jensen: Not for a long time. But we get to tinker through other people. This is the wonderful thing. NVIDIA is now 24,000 people. If I could tinker a little something with everybody, the amount of tinkering that's going around the company is incredible. There's a phrase that I say. I reach out to my friends — and I really see them that way — I reach out to my friends all over the company, and we brainstorm a little something and they go off and try something and somebody else they're brainstorming with, they try something. That's, I guess, tinkering at scale. Lukas: That's super cool. I love it. Another question a lot of people ask...I'm curious, people originally think of NVIDIA as for games. Are you a gamer at all? Do you play video games? Jensen: I haven't played much games. I see almost every game that goes by, because we get the benefit of some collaboration that we do with just about every game company in the world. So when they're in the labs, people will tell me and I'll run down, and go check it out, and play with it a bit. But the last time probably...one of my favorite games was when Battlefield first came out. My kids were teenagers at home and they were both coming into their gaming age. And the three of us playing online Battlefield was just incredibly fun. That was probably some of the funnest memories I've ever had. Lukas: That's awesome. I'm curious. A lot of people have been talking about, you know, supply chain issues and a global chip shortage. Is that something that's on your mind a lot? Is that a problem for your company? Jensen: Sure. Yeah, sure. We build the largest chips in the world and the most complex computers in the world. DGX is a few hundred pounds. It's so heavy. It's the heaviest computer that's being built today. It is so heavy that it takes a robot to build it, like a car. Most computers don't have to be built that way, but DGX is a miracle of computing. And we built it completely from a blank sheet of paper, wrote all the software and all the tools that went on top of it. There's a lot of components inside, especially...something that's a few thousand watts is quite a miracle. There are a lot of parts, and all it takes is one diode or one voltage regulator to keep it from shipping. So our NVIDIA supply chain is quite an amazing machine. We know that artificial intelligence is such an amazing thing because we are producing intelligence. For the very first time in human history, we are producing, manufacturing intelligence, like production. Raw material comes in. A lot of genius goes into that box. And what comes out is intelligence that's refined. And so, large companies are depending on us. AI is intelligence being manufactured at large scales. So the teams are working really, really hard to keep up with demand. Lukas: You've been running NVIDIA for quite a long time. I was curious how you feel you've changed as a leader over the decades of running the company. Jensen: You know, you're almost asking the wrong person. You could ask almost anybody else around me. Lukas: Fair enough. How has your experience changed? Jensen: That's an easier question for me. When I was 30 years old, I didn't know anything about being CEO. I did a lot of learning on the job. There were many management techniques that were just really dumb, and I don't use them anymore. Lukas: Like what? Jensen: Well, alright. I'll give you a couple. Lukas: Awesome. Thank you. Jensen: The list of dumb things that I've done over the years is quite large. I could write a book. But for example, I really wanted, in the early days, for the chips to tape out. I thought what we needed to do was motivate the engineers to tape out the chip. So we had this thing called a ""tapeout bonus"". And that's just a supremely dumb idea. The reason for that is because if the engineers could have taped out the chip, they would have. Putting that bonus there is unnecessary. On the other hand, by definition, they're gonna be late. And when they're late, it becomes a de-motivator, because they no longer can earn a bonus. The tapeout bonus — for all the CEOs that are doing it — it's a de-motivator, not a motivator. It's a little silly. I think the answer is, a chip gets taped out when a chip is ready to be taped out. We can create the conditions by which great work can be done. We can be good listeners and eliminate obstacles for the team. We could be part of the solution by highlighting issues, recruiting. All kinds of things that we can do to help them reason about priorities, help them reduce the scope of their work, and try to seek the minimum viable product instead of building such giant things. There are a lot of different skills that we could've instilled into the organization, but the one thing that it doesn't really need is a tapeout bonus, an achievement bonus. Because everybody's trying to do their best. That's one example. Lukas: That's a great one. What else? If you've got others, I'd love to hear them. Jensen: Okay. Here's another one. Well, I want to be diplomatic as well, because there's so many CEOs that are out there. They could be using some of these techniques, and I hate to be critical of them. So this is not a criticism, this is just my style. I tend not to do one-on-ones. If there's anything that I need to say, I tend to like to say it to the team and the group that is working on it, so that we're all hearing the same things. I'm hearing the same things, everybody else is hearing the same things, instead of being translated. Lukas: Interesting. That's a really unusual perspective. I think a lot of people think you absolutely must do one-on-ones. So you do that across the company? Do you think like your reports- Jensen: -I don't do it. I don't do it, but I have many leaders who do. I don't criticize them for doing it, I just don't do it. The reason that it's probably more important for CEOs not to, is because...you can't eliminate it completely, but you want to reduce the amount of, ""Jensen told me,"" or ""Jensen told me that,"" as a way to somehow steer a conversation that otherwise should have been done on merits. And instead of my will somehow being translated and repeated and interpreted through a chain. If I had a particular objection towards something, I would say it to more than one person. If I believe that in working with the rest of the company a particular strategy or direction ought to be taken, I would tell everybody at the same time. I've worked towards this approach because I feel it's much more transparent. It puts knowledge and the access to information in the hands of as many people as possible. And of course it attracts more criticism to myself. For example, I might say something to ten people and it is the dumbest thing in the world to say. It was a terrible idea, you know, couldn't be a worse possible strategy. But instead of saying it to one person, I don't get the benefit of refining my ideas and then broadcasting it and always being a genius. Therefore, in this technique, you need to be a little bit more vulnerable, and you need to be able to deal with the fact that every so often you said something that wasn't perfect. Nobody holds me to a standard that needs to be perfect, anyhow. And so I, after nearly 30 years, I've kind of worked my way past that. If I say something dumb, don't hold me to it. Give me a chance to change my mind. Lukas: Is it a different experience, running a company where it feels like it's struggling versus now, where the stock seems really high and probably everyone's feeling really good about the prospects? Do you have to do different things in those different situations? Jensen: I'm never different. I don't think it's possible to find a correlation between my behavior and the stock price. And I would say for 29 years, my behavior and the way that I approach problems, the way I approach people, the way I approach a company or work...exactly the same. There's no correlation whatsoever. You just got to give me a second, I'll find all kinds of issues to talk about. I've got nothing but problems that...you know, CEOs are surrounded by problems, not good news I happen to enjoy that. I enjoy solving problems. So I completely separate the financial success of the company from the importance of the work and doing impactful work. I've historically always done that, whether the company is doing well or badly. When we were doing badly, particularly during the time when we bet the farm on accelerated computing — we wanted every single chip to have the same architecture that I mentioned earlier — the pressure on our financial performance was immense. But I was equally as enthusiastic then, and believed as much in the future, as I do today. Lukas: That's incredible. You don't feel the outside pressure at all, or are you able to separate yourself from it? Jensen: No, as a public company you're going to feel a lot of outside pressure. Some investors are really artful in expressing their displeasure and criticism, and some investors are understandably less patient. But it's our job to express the reason why we're doing what we're doing. CEOs have to be...we have to be reasoned. We have to have a purpose by which we're doing something. If we're clear in expressing why we're doing something, and our vision for it, and we genuinely believe it — we genuinely believe it — my experience has been that people are willing to give it a shot. When we first started our company, consumer 3D graphics didn't exist. Even APIs for it didn't exist. We had to go evangelize that. And it took longer than people thought. When we moved into accelerated computing, for about 15 years it didn't exist. It took longer than I thought. I thought it was going to take 2 years, but it took 15. AI was the same way. I spoke endlessly about the importance of machine learning and deep learning for the first 5, 6, 7 years. I think people just didn't get it. Which is fine. That's part of building a new market and building a new approach. You have to recognize that it takes time for people to come along. I think the industry has been really patient with us, and our employees have been very patient with me. I've really appreciated it. Lukas: What's the thing that really motivates you right now? What's the purpose that you feel like you're serving at this moment? Jensen: Our mission...the company doesn't have a mission statement, but nobody's confused at our company in what the mission is. It really is as simple as, ""Do impactful work,"" that takes a very long time to succeed — because it has to be hard for it to be meaningful for our people — and that we are the best in the world at solving. We seek those problems. I seek those problems. There are two areas that I'm super excited about right now. One area is recognizing that we — in several domains — have invented the intelligence capability, the technology of intelligence. Whether it's in perception or speech AI or language understanding, we're now able to have some technologies that can do these things. However, ultimately what's valuable is not intelligence. Ultimately what's valuable is skills. We hire new college grads with lots of intelligence, but very few skills. And then we give them skills by adapting them to domains. In a lot of ways, that's essentially...what is missing right now is to take the intelligence technology and translate it into valuable skills. Valuable skills, whether it's driving, autonomous vehicles. Valuable skills like customer service, and call centers, and such. Valuable skills like automated checkout. It could be automated skills like radiology. Put a radiologist right into the instrument. There are all kinds of really valuable skills that we can now create. That's a big part of where our energy is right now, how to take this enabling technology and translate them into skills that customers in the industry, developers, could then adapt it for all kinds of different domains. That's one, the large-scale application of artificial intelligence. Second, is the next era of AI. We've done a really good job with soft AI that's in the cloud. Recommending music, recommending movies and the next item in the cart, and so on and so forth. It's really incredible. The thing that we would really like to do is to...if we want to take AI into the point of where people are and into this next phase of its journey, AI has to learn the laws of physics. Many of the world's challenges — whether it's climate science or autonomous vehicles or manufacturing or whatever it is — the AI can't just make a prediction. It has to make a prediction that obeys the laws of physics, conservation of matter, conservation of energy, and such. It has to understand the concept of synchronous time. It has to be working within our time. There are a lot of these types of problems that are really impossible, to develop that AI unless we have something that is essentially a virtual world that obeys the laws of physics. Which is the reason why we built Omniverse. We built Omniverse so that several things could happen. It's physically based. It's distributed. It's very large. It has the ability to support very large models. And the goal is several fold. One, you could teach a robot how to be a well-functioning robot in this physically based environment. You could connect it to IOT systems, for example, running a robot hardware in the loop. It has the ability to be connected to the physical world and stay synchronized, meaning to build a digital twin. The concept of a digital twin has been around for some time, but in combination with artificial intelligence, the digital twin is going to have a profound impact on the future. So, I'm super excited about these areas. One is just the application, and then the other's the next phase of AI. That's what Omniverse is all about. Lukas: Yeah. I totally agree that things like Omniverse is really critical for making robotics work. It sounds like you're interested in getting your company closer to the applications of AI, is that right? Jensen: We'll stay a couple of clicks away from the actual application. But what we would do is we would create an application framework for people who are building applications to build applications. One of the application frameworks that I'm really excited about created a little demo. They called it toy Jensen, at the last (GTC) keynote. Basically, it's a robot. But it's a virtual robot, otherwise known as an avatar. It has computer vision, it has speech AI and understands language. So on and so forth. I'm super excited about that because in the future, many applications...we really need to go into the application to experience it, whether it's a virtual factory or virtual hospital or what not. It could be for entertainment, like the metaverse and the next era of the internet. You want to go into that world. And the way to go into that world is through a wormhole called VR. We can go into that world. But we could also have those agents come out of that world and collaborate with us. They would come out through the wormhole called AR, and be in our world. But otherwise, the metaverse is enjoyed using my favorite display, which is a computer display. People think that you need to wear head-mounted displays for the metaverse, but it's furthest from the truth. The metaverse will be enjoyed largely on 2D displays. Lukas: Interesting. Well, we always end with two questions that I want to make sure that I get them in. The second-last question — and you've touched on some of these topics, but I'm curious — when you look at machine learning, do you feel like there's a question that's underexplored? Like you would recommend to a grad student to look into, or if you had more time you'd like to spend some more time investigating? Jensen: Well, some of the research work that's being done right now — there's so many smart people working on it because it's really important — the self-supervised learning approaches that are multi-modality...Lukas, that's going to drive the living daylights out of the platform you're building and the platforms we're building. Multimodality AI, where you have vision — and the vision doesn't have to just be images. It could be video, speech, and natural language — that's going to take perception to a brand new level. I'm super excited about that. I'm excited about zero-shot learning. To be able to learn from whatever you're trained on, plus the priors that you have, is really quite exciting and powerful. I think that one of the areas that is being explored now is to project the framework of graphs into the framework of deep learning. Or, graph neural networks. Graph neural networks...graphs, the relationship of things, is basically a structure that can describe almost everything meaningful in life. That's why it's so useful. Lukas: Totally. Jensen: But the processing of graphs is cumbersome. The breakthroughs with DGL, and GNN, and geometric, and all of that to project the graph into the framework, the constructs of a deep learning pipeline, puts it into our world where deep learning has been so effective. I'm excited about that, and I hope that a lot more people do that work. Lastly, I think there will be more innovation and more design and more creativity that's going to be done in the virtual world, than all of the creativity and design that has ever been done in the physical world. What people call the metaverse is going to be just brand new ground for manufacturing, for design, for artists, for entertainment of all kinds. I'm super excited about that. I mean, there's so many things to work on. Lukas: Awesome. That was a great answer. Our final question, in the last few minutes we have...there's this trope that machine learning, especially deep learning, projects almost never see the light of day. That they're way harder to manage than traditional engineering. I'm curious. When you look across your customer base, what are the most common issues that prevent machine learning from really solving the problems that customers actually have? Jensen: Yeah, this is really great. It's a great question, and it's also one of the things I love the most about your company and the way you think about this. There's a fundamental difference between the technology of deep learning and the harnessing of deep learning and machine learning to write software. The importance of the methods and the process and the tools, that is so vital. What could be described as MLOps, so vital. You have to understand not just the neural network architecture — and to be able to invent something that produces excellent results is of course groundbreaking work there already by itself — but a company, in order to take advantage of this, has to realize that in the final analysis, this is an intelligence factory. You have to think of it like a factory. That's the reason why the word ""ops"" makes sense. It's a factory. You have the raw material coming in, which is the data. It gets transformed in the middle, through a lot of stages of very complicated transformation. Which is one of the reasons why your tools are so popular. It's really complicated stuff. To manage that workflow in a productive way and transform that raw material into ultimately an output that is a neural network or otherwise intelligence-at-scale is quite a significant process. It's a fundamentally new way of thinking about computer science. We used to have just engineers do it. I don't mean ""just"" in that way, but we had engineers do it. But now we have engineers backed up by giant supercomputers that are operating these incredible operations — software stack — that you build. The refining process, the continuous refining process, the validation process, the simulation process...that entire process had to be reinvented for machine learning, reinvented for deep learning. This is the reason why your work is so important. You guys are doing a great job. I really appreciate the work that you do, and all the researchers that you support, and all the workflows that you are making possible. This is what every company needs to understand. That software development in the future is a bit of a refinery process. It's a refinement process. It's an MLOps process. It's, you know, manufacturing. Lukas: Well, thanks so much. That's really kind of you and I'm touched. I appreciate it. Jensen: Keep up the great work. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work were really hard to produce. So check it out.",7079 +Peter & Boris — Fine-tuning OpenAI's GPT-3,https://www.youtube.com/watch?v=CitdnuOGK48,2619,2022-02-10,"Peter: We have these kind of two camps of users. The researchers and the developers. Developers keep telling us like, ""Hey, I just want one button. I just want the best model to come out."" And then a lot of the researchers want to fiddle more with the parameters. I think we can probably satisfy both for a long time. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald. Today, I'm talking with Peter Welinder, longtime friend and currently VP of Product and Partnerships at OpenAI, running GPT-3 and other things. Before that, Research Lead at OpenAI, where he was one of Weights & Biases' very first customers. And before that, Head of Machine Learning at Dropbox. I'm also talking with Boris Dayma, Machine Learning Engineer at Weights & Biases, and we're going to talk about GPT-3 and the recently announced integration that GPT-3 did with Weights & Biases. This should be a lot of fun. Lukas: Peter, the last time we talked I think you were working on research at OpenAI. That's most of the time that I've known you, but now we find that you're VP of Product and Partnerships at OpenAI. I'm kind of curious what that means and what you're doing day to day. Peter: Yeah, sure. What I do today is quite different from when I did research, for sure. For me, doing research has always been about solving the hardest problems that are out there, in order to actually have some sort of impact on the world. I'm kind of personally much more driven by the end goals of research rather than the research itself. It's really fun to do research, you know, go down and explore things research-wise, but it's always been with some goal at the end of it. One exciting thing that has happened with GPT-3...a lot of the things that I did when I started at OpenAI was like, I did things on the robotics side. With robotics, there's still some gap from the stuff you can do in the lab and what you can do in the real world. With GPT-3, when we got our first results in GPT-3, it was kind of clear that we had something that we could start applying to real-world problems rather than just do cool demos. When I worked in robotics, what we got at the end was a really cool demo of a robotic hand solving a Rubik's cube, but it's not like you start deploying this in everybody's home. Even if it worked robustly enough to do that, I don't know how useful it would be to solve a Rubik's cube. It's a very expensive way of doing that. But with GPT-3, we had a language model that you can now apply to solve all kinds of different problems. Everything from translations to summarization to things like classification and question answering and so on. It was a very flexible model. So, what we set out to do was to kind of start just seeing if this was good enough of a model to actually solve real-world problems. For me, that's just a really fun area to focus on. When you have this really powerful new technology that has the potential of just changing a lot of things in the way they work, it's all about finding the right problems to go after. And then seeing how you take the tools you have in your toolbox to solve those problems. The difference is that what I did as a researcher was very much kind of coming up with the right kind of benchmarks and the right ways to measure progress, where there was a goal that was really far out and you needed to come up with these toy ways of evaluating progress. And now it's like customers telling us like, ""Hey, I'm trying to apply GPT-3 to this use case,"" and it doesn't work or it's too slow or something like that. Those problems are much more concrete. My day-to-day...right now, it's much more around building a team that can solve these real-world problems with the technology that we have developed at OpenAI. Lukas: When you look at GPT-3 versus the other approaches for large language models out there — that kind of seems to be a trend — are there key differences that you notice in how it works? Is the take different somehow? Peter: Yeah, it's a good question. I think that what I really like about GPT-3 — and the main way in my mind that it's different — is that it's just extremely simple. All that GPT-3 does... So, GPT-3 is a large language model, big neural network. It's using this Transformer architecture that Google introduced a couple of years ago that has been really popular. It's basically powering all different language models these days, and it's starting to make its way into other areas like computer vision as well. But the way GPT-3 is set up, it's very simple. It has some context, which basically means it has...you can look at a history of texts. Like, if you're reading a book, you can look at the page of texts or the paragraph of text. And then it's trying to predict the next word. That's the way that GPT-3 is trained. It's just trained on lots of texts from lots of different sources, mostly from the internet. It's just trained to kind of over and over again, based on some words it's seen, predict the next word. You can start with only a few words, but when we train these models today, we train them on the order of like a thousand or a few thousand words. You can look back at those thousand words and then try to predict the next word. So the setup is super, super simple and you just train it on these huge datasets of texts in order to keep on predicting the next word and get really, really good at that. I think the surprising thing with GPT-3 was that if you do that, and then you make the model really, really large — so it has a huge capacity of learning — then it gets really good at a bunch of tasks for which you previously needed specialized models. If you wanted to do a translation, you would need a specialized kind of translation neural network. Or if you wanted to do summarization, similarly you would set up your network in a particular way, and then train it on only summarization tasks. What we found with GPT-3 is that you actually get very close to state-of-the-art performance on a number of these benchmarks — that measure things like summarization, translation, question answering, and so on — with a model that has just been trained on the Internet to not do any of those tasks specifically, but by just being able to reproduce text in a similar way that it has read it. Lukas: Practically though, how do you apply it to a translation task? How do you take ""predicting the next word"" and make it do a translation? Peter: Yeah, that's a great question. In a lot of those other large language models, there are certain steps where you would take a piece of text and you would encode it. So you would create some representation in your neural network, and then you would have a decoder that would take that and then write some sentence. If you did translation, for example, you would encode that into some sort of representation, and then you would have a separate piece of your neural network that took that representation and tried to output what you wanted. The input might be like a sentence in German and output might be a sentence in English. And, you know, it's been trained specifically for that. For GPT-3, to your question then, what do you do with GPT-3? The simplest way you would do it is that you would provide a few examples of what translations might look like, in just pure text. You would write, ""German:"" and some sentence in German, and then ""English:"" and some sentence in English. You could provide only a single one, then the serve is called one-shot. You can provide a few examples of basically, ""German: English:"" examples, and then you would put in the new sentence that you would want to translate. That's called few-shot training, where you have a few examples and the model would, just by looking at the pattern of what it's now seeing in its context, it can predict...it can produce a translation. It's a very simple set up. Basically, the way I think about telling GPT what to do is a little bit like how you would actually tell a human to do the same thing. Like, if you're writing an email...if I'm writing an email to you, ""Hey Lukas, I want you to translate some sentences,"" what I would do is like, I would just ask you, ""Please translate these sentences."" And I would maybe provide a few examples to give you a sense of the tone. Like, do I want a more formal translation, more casual translation, and so on. You would pick up on the pattern. Given then a sentence in German, you — I don't know if you know German — you will be able to translate it to English. It turns out now with our latest models, you don't actually even have to provide those examples. You can often just ask the models just as you would ask a human. Like, ""Hey, translate this sentence to me,"" or ""Summarize these piece of texts"". We just found that that's how people wanted to use the models. We made them more work like that, but that's how simple it is. You just tell it what you want to do and it will do its best attempt at just doing it. Lukas: Did you make a concerted effort to train the model on multiple languages or was it mostly English? Where did the corpus come from? Peter: We actually did the opposite. Initially, when we trained GPT-3, we made a concerted effort not to train it on other languages than English. It turns out that even though these models are huge, there's a trade-off in your dataset mix. If you train it on English, but then lots of other languages, it would just not end up being as good at English tasks. And ultimately when we train this, we want to see, just generally, how good can it be at more general capabilities? We didn't care as much about translation. So whenever we put in extra languages, that would just be at the cost of being good at performing other tasks in English, like question answering, and summarization, and so on. But it turned out even by explicitly trying to filter out most other languages, probably a few small percentage points of the data turned out to be in other languages. And even with that, the model is just incredibly good at translation. It's close to state-of-the-art in a lot of translation tasks. I'm a native Swedish speaker, but I've lost my ability to write things in Swedish these days because I never do it. What I do these days is, I write it in English and I ask GPT-3 to come translate it to me. That's usually my point. It won't get it perfect, I need to fiddle with a few things, but it's surprisingly good. And the amount of Swedish training data in the model was really, really small. We've been constantly updating our models and making them better and better, so now we are introducing more and more language data, as we kind of figured out how to make these trade-offs in more optimized ways. But yeah, originally we actually wanted the opposite. We just wanted to be really good at English. Lukas: Is it predicting words or is it predicting one character at a time? How does that work? Peter: It's neither of those. It's actually predicting something called tokens, which is like....""part of words"" is maybe the way to think about it. The most common English words, they are captured by a single token. A token, it's basically...in our current set up, we have about 50,000 of these tokens and we map them onto sequences of characters. It ends up being like...a common word like ""hi"" or ""the"" ends up being one token. But if you have a more uncommon word, like ""encyclopedia"" or something, you're probably going to break it up into two or three tokens. It's like word pieces that just make it easier and more efficient for these language models to consume texts. In principle, you can actually do it at the character level as well. It just gets very inefficient. But you know, that's where the field is probably moving. Eventually it's going to just do that at the character level. Lukas: But I would think that might make foreign languages really hard. Like, for example, would Asian languages be impossible then? If they have far more tokens. Or I guess maybe you could argue they've sort of done the tokenization for you by having a larger number of characters that encode a bigger chunk of meaning. Peter: Yeah, it is definitely the case that the way you train your tokenizer would have an impact on the performance of different languages. Usually those two things have trained in two different steps. You would train your tokenizer on some corpus of data, and then you would separately train your models with that tokenizer on some other datasets. And in order to get your model really good at different languages, you need to train the tokenizer as well over multiple languages. It's definitely...it's more expensive to use other languages because they end up...a German word just ends up being more tokens because we've trained on much less of it, while English is very efficient, where a lot of words are a single token. So it makes it both a little bit worse at other languages and more expensive. Lukas: Could I translate something into Japanese? Would that even be possible for GPT-3? Peter: Oh yeah. One comment I remember was a Japanese user of ours. They really liked to use GPT-3 to translate technical documentation between English and Japanese, because they found that GPT-3 was much better at this translation of technical documentation than Google Translate. This was like a year back, so it's possible that Google Translate is better now. But probably just a chance thing based on the datasets that we had. The really cool thing, actually, with the translation capabilities of GPT-3 is that we haven't really trained the model on explicit pairs of input and output, translated pieces of texts. Like what you usually call ""aligned pieces of text"". It's just seen a lot of Japanese. It's seen a lot of...well, not super much. It's seen a bunch of Japanese, but a whole ton of English. Somehow, through learning how to predict the next word, there's been enough little pieces of texts, blog posts, or whatever — where the author is switching between Japanese and English and maybe doing like some translation on some sentences — where it found the mapping and then somehow has a representation that's good enough then to generalize to arbitrary translation tasks. For me, that's just magical. That it's just by reading lots of English text, lots of Japanese text, and then maybe like accidentally finding a few kind of aligned pairs in all of the data, it's able to do that translation. That's pretty crazy to me. Lukas: That is really amazing. Is this performance tangibly different than earlier versions of GPT? Like, was there something that happened in GPT-3, where OpenAI thought, ""Okay, we can use this for real-world commercial applications""? Was it a performance level that it needed to get above? Peter: Yeah, definitely. I think the big difference between GPT-2 and GPT-3 was really...it was trained on more data and it was a bigger model. Like by two orders of magnitude. I think the original GPT-2 was about 1.5 billion parameters and GPT-3, the biggest model, was 175 billion parameters. It went up by two orders of magnitude, and since it was much bigger model, it also needed more data. The surprising thing is that that's what it took to go from feeling fairly kind of dumb to interact with...like GPT-2 was kind of cool, but it also felt kind of incredibly stupid most of the time. I think with GPT-3, you went to being sometimes just surprisingly good. Don't get me wrong, GPT-3 does a lot of silly mistakes still. But it does the right thing probably like 30-50% of the time on some tasks, and sometimes even better than that. It's sort of like suddenly...before you would need to sample and try out tasks and maybe once every 20 or something you would see like, ""Oh, this looks pretty good."" And with GPT-3, it started happening like every third time, or every second time, or every fifth time. And you're like, ""Oh my God, this is actually..."" For things like summarizing text...one example we have is summarizing a piece of text in the style of a second grader. It's just incredible how the model is able to simplify words, get the gist of a piece of text, and so on. Again, it's not perfect, but it's just really good. Obviously, we have...there's a lot of academic benchmarks. You can run these models and you can see it's just getting much better on those academic benchmarks. But it was a whole different feel to it when you wanted to prototype something. The difference is that now it's just easy to get something that works pretty well. That's sort of why we decided like, ""Hey, now it seems useful."" GPT-2 didn't seem really useful to the same extent, but GPT-3, for all these tasks we felt like, ""Okay, it's close enough to state-of-the-art.""If you have a specialized model or whatever, a clever programmer should be able to apply it to whatever tasks they have. That was what we set up to validate with the API. Lukas: What are some of the use cases that you feel really proud of, where it really works? Are there any that you could point us to, where we could go interact with it in a commercial setting somewhere? Peter: Yeah, sure. I think some of the areas where we were most surprised were copywriting and question answering. Generally, creative writing. For copywriting, what happened there was that there was a number of companies that started building on top of our platform. Some of these companies are like...Copysmith was one of the first ones; CopyAI; there was also Jarvis, I think recently they changed their name to a different name; and a number of other of these companies. What they did was really clever, because they realized that — as I said — when you're using GPT-3 to do some task, it's not perfect. Every now and then, you would get something that doesn't really make sense. But if you're doing copywriting tasks, like if you want to write some engaging product description based on some attributes of a product — like a shoe, maybe the type of sole, the color, some other attributes of the shoe — and you want to write something really engaging about that, then the problem that you as a human face is that you get into some kind of writer's block. Like, where do I even start? What these companies started doing is they took GPT-3, and they used it to generate a few starting points or a few variations of how you could write product descriptions. What you find is more often than not, if you generate like five of those examples, one of them would look really good and you can use that as your starting point. You maybe just take it as it is, or you make some small tweaks to it. It's a way to almost aid in human creativity, you know? I think that's just so cool. Writers would tell us like, ""Hey, I've been trying to write this book for like half a year now. I just keep on getting stuck in writer's block. Then I started using a playground for GPT-3, and now it took me two weeks to turn out the whole book."" When you get stuck, it can create an interesting storyline. As a creative writer, you start exploring that like, ""Okay. I wouldn't have thought of this character going down in that direction, but let's explore that."" And then it becomes a much more fun, engaging process. It's almost like as a human, now we have a brainstorming partner that you can apply to all these different tasks. I think what I found was really cool is to see a number of companies really leveraging that and creating new experiences that you couldn't do before. I think that one is really exciting. I think question answering is also really cool, but this one was quite unexpected. I don't think we would have predicted that one being such a big use case. Lukas: It seems like one of the advantages of GPT-3 is that it works right out of the box. But I could also imagine for some teams there might be a concern about what do you do if something goes wrong. I guess I'm curious. Do you typically work with ML teams inside of companies, or is it more engineers that view the benefit here as that they don't have to figure out how machine learning works to get the benefit of natural language processing? Or do you tend to integrate this with ML teams into a kind of bigger ML workflow? Peter: Yeah, that's a good question. It's a bit of a mix, I would say. We've had multiple machine learning teams who already had their own models that...they would have downloaded the models online, and so on, and they would have adapted them for the tasks. And then they find our API and start doing the same thing using our API, and it just turns out that you can get much better performance from our models. Like, just because there doesn't exist... there isn't an open source version of the biggest models that we have, or the best models. So for a lot of tasks, that's what works the best. But I think probably the majority of our customers are more in the other camp of just ""really smart developers"". When I say ""developers"", it's pretty broad a group. We see everything from programmers and engineers, to designers and PMs. A number of people have told us that the OpenAI API was sort of what got them into programming, because they got really good results from just our playground, where you can interact with our models. They got ideas, and they started to learn how to code, and got connected with no-code tools like Bubble IO and stuff like that. It's really lowered that barrier. You don't have to become a machine learning expert to get really good results out of these models. You just kind of have to be good at iterating and figuring out how to write the instructions to the model. It's a little bit like...everybody becomes a manager. You have to give really good instructions to your employee if you want them to do the task as you want it to be done. It's very similar with these models. Like, if you underspecify your tasks, you're going to get very high variance in the outputs. But if you get really good at specifying — even providing a few examples — then you get really good results. That's not a machine learning skill, that's almost more of a = task specification, management skill. I feel like a lot of people can pick that up really quickly. I've been really excited about that, just seeing so many people get access to these models that just seemed like you had to have a PhD in machine learning to work with before. Lukas: I feel like I've heard of people talk about a new role called ""Prompt Engineer"" that might be related to this. Figuring out how to prompt GPT-3 to get it to do what you want it to do. Peter: This one is interesting because...early on when we had the first version of the API, we had a really smart guy who is a world-renowned author, but also a programmer; Andrew Mayne. He was one of the early users of the API and he got the internal name of ""the prompt whisperer,"" or ""GPT-3 whisperer"". He really knew how to craft the prompts to get the best results. Since it's been trained on the internet, you kind of need to put your mind in like, ""How would the text in the internet start?"" If you wanted a really good recipe, you have to start writing in the tone of a recipe book or a food blog post or something like that. It's not like you could just ask the model to do what you wanted it to do. I think, initially, there was a big piece to that. You really had to be good at understanding the intricacies of GPT-3 and design really good prompts. Over the past one and a half years since we launched, we saw people struggling with this a lot, so we developed a new set of models. We call it InstructGPT, which actually just like last week became the default in our API. The reason we're calling it InstructGPT is because you just provide instructions. So I would say prompt design is a little bit less of a thing now. You could just tell the model what you want it to do and provide a few examples. There's still a little thing about...the formatting might impact how you provide your examples and so on. GPT-3 is super robust to that, but sometimes it does matter a little bit. Some tweaking matters. But I would say it's less of a thing now than it was a year ago. And my hope is that it becomes less and less of a thing, and it becomes much more interactive. Lukas: You've also launched the ability to fine-tune the models. What's the thinking there and where's that useful? Peter: The surprising thing with GPT-3 was that you got really good results zero-shot, where you only provided an example...no example, just the instructions of like, ""Hey, translate this sentence from German to English."" Or you provided few-shot examples, where you provide a few pairs of German and English. With just a few-shot examples, you could get surprisingly good results. But what that meant in practice is that...the accuracies are very task-dependent. For some tasks, maybe 30% of the time you got to an output that was kind of acceptable to put in a product. And then for other tasks that were more simple, you'll get it like maybe 70% of the time. When it's not good every time, you have to be very clever in the way you can expose it in your product. That's why, for example, it worked well for a lot of those copywriting companies. You could just provide a few examples and you kind of knew that at least one of them would be good, and that's all the user needs. But with fine-tuning, what you can do is basically...you can customize your model. You can provide it more examples of the inputs and outputs you want it to do. If you want to do translation, or if you want to summarize articles, you can provide a few hundred examples of articles that have done human-written summaries, and you can actually update GPT-3 to do much better at that task. You couldn't put all those examples in your prompt. The prompt has limited space. But with fine-tuning, you're working these examples into the connections of this neural network, into the weights of the neural network. In some way you have like an infinite prompt. You can provide as many examples you want. Obviously, the more examples, the longer it will take to fine-tune and the more costly it will be. But fine-tuning is basically that concept of taking a bunch of input and output examples, and kind of working them into the model, and getting a new version of the model out that's really good at that task for which you provided examples. It turns out with only a few hundred examples — or around a hundred examples — you can get significant boosts in accuracy. We have a number of customers that have used it. Like Keeper Tax, they're analyzing transactions to find these tax write-offs and stuff like that. What they're doing is they're extracting the relevant pieces of texts, they're classifying, and so on. They fine-tuned models and got much, much better results with fine-tuned models, for example. We've seen that over and over again with our customers. They can get really good results that can often be good enough for a prototype, but then in order to get it to high enough accuracy to put it in production — which is usually more than 90% or 95 or 99% — fine-tuning on some datasets that they have, or they put together, gets them all the way. That's enabled many more applications than you could do before. So we just made it very simple to do this kind of fine-tuning. Lukas: Cool. And you know, I have to ask you about the Weights & Biases integration. I mean, we're so excited about it. I don't know if people listening would know that you used Weights & Biases from the very early days and provided a ton of incredibly useful feedback that's in the product. But I was curious how you thought about how that integration might be useful for users of GPT-3. Peter: So, this is the background of my usage of Weights & Biases. I was one of the first users and it just improved my research workflow so much that I'm a big Weights & Biases spokesperson now. Basically what it does, right, is that it allows you to track your experiments in a really nice way. As you're training your models, you can get all the stats. Anybody who has trained machine learning models knows that you have to look at a bunch of curves as you're doing your training, to make sure that the models are learning in the way that you want. A lot of the work you do as a machine learning engineer is to do that sort of iteration on your models and see if you can improve your results. And a lot of that is looking at those learning graphs and on. It's really good because Weights & Biases provides you with this history of the experiments you've run. They let you compare experiments and let you track your progress and share it with your teammates and so on. What we did is basically make an integration, so that as you're fine-tuning your models — your GPT models — via our API, all your experiments, all your training runs show up in the Weights & Biases interface. You get that same convenience, but now for things that are training in our clusters. You can see as our fine-tuning process is happening — as the model is updating its weights based on each new iteration of going through the dataset — you can see your metrics, and so on, improve. You can also...we provide a number of different parameters, so it lets you iterate and try out different parameters and see your progress. It's just much more delightful to train your models that way, to have that place where you can go and look at your results in an ongoing way. That was a super exciting integration for us. It lets you keep track of all your fine-tunes in a much better way than...we have a command line interface, it's not at all as pretty as the Weights & Biases way of tracking things. Lukas: Boris, you actually said you did the integration. You said it was one line, is that right? I mean, my question for you is more how you thought about how it might be used, but I'm curious. Was it really a one-line integration? Boris: There's a few more in the code, but the way for the user is just to type a line, to type ""openai wandb sync"", and it can automatically sync all these runs to a dashboard. The idea was that there's a lot of people who use the API that are not ML engineers, so you don't want them to have to learn, ""Okay. What am I supposed to log? How do I take care of a data set?"" The OpenAI API, it was so convenient. When you want to train a model, you just pass a file that is your dataset and it cleans up the dataset, and then you pass a new command and it fine-tunes everything. It was a bit the idea of keeping the same simplicity. You will just type that one command, and then all the magic happens behind the scene. You have all your visuals and you can compare your models and see, ""Is it worth giving more training samples? How much did my model improve from that? What is the effect of tweaking that little parameter here? What dataset did I have when I trained that model?"" It's trying to make it as easy as possible for users to benefit from all the features when they don't necessarily know Weights & Biases initially. Lukas: I guess for both of you, what are the parameters that you can actually tweak? Because the way you've described it, it sounds to me like there might not be any parameters. How do parameters get involved here? Peter: Before I answer that question, one thing that Boris said that really stands out to me...why I really liked this integration generally was that there there is this concept of just making these advanced things very simple. I still remember when Lukas, you, Shawn, and Chris did the first Weights & Biases demo. It was basically just like ""import wandb"" to just start logging an experiment. I think that philosophy of just making it super simple to get going is something we have tried to also do in our API. You ""import openai"" and then like a single API call or Python or JavaScript gets you to use GPT-3 and start creating completions and stuff. I really liked that simplicity, and that's what we tried to do within this integration. But, to your question about the kind of parameters, we tried to make this quite simple in our API. We tried to make the defaults very, very good. Generally, you can get really good results with fine-tuning without fiddling much with our parameters at all, but some make more of a difference. You can set, for example, the learning rate. That's how much you're updating the weights with each learning step. You can set things like how many passes you want to go through the data. It turns out if you go through the data too many times, then you're going to overfit on your data set. These GPT-3 models being really big, you often only need on the order of two to five iterations through your data to get really good results. If you go further than that, you sometimes overfit. There are more advanced parameters as well, but I kind of feel like playing a bit with the number of epochs you want to train it for and their learning rate, that gets you 90% of the way there. If you start fiddling with other parameters, it's not going to give you that much more. Lukas: Was part of the thinking of leaving the parameters in to just give the person...you can get the joy of messing with parameters? Peter: Honestly, I would love it if it was completely automatic. That said, we do have a number of more research-oriented customers who really do like the fiddling. So I think it would be hard for us to remove it. But, as I said, we have these kind of two camps of users. The researchers and the developers. Developers keep telling us like, ""Hey, I just want one button. I just want the best model to come out."" And then a lot of the researchers want to fiddle more with the parameters. I think we can probably satisfy both for a long time. Lukas: Boris, I don't know which category you put yourself in. You make some amazing, beautiful demos, and you also love to tweak parameters. I'm curious your experience playing with the GPT-3 model. Boris: I definitely like having a good default, because initially you don't really know what you should change on it. Let's say you would choose the wrong parameter and nothing works. It wouldn't be a nice experience. So I like that if you don't choose anything, it's already going to be pretty good. Then, I really like to tweak the parameters to see, ""Okay, what would be the effect?"" and try to play with intuition. In addition to the parameters that Peter mentioned, there's two that interest me a lot too. You can decide which model you fine-tune. There's models of different sizes. If you use a larger model, maybe your API is going to be a bit slower but your [?] will be better. Maybe sometimes you don't need it, maybe sometimes indeed. So I like to see the effect of which model I use. I like to also see the effect of ""How many training samples can I give?"". Like if I give only 20 samples, versus giving 100 or 200. Because then it gives you an idea on how much my model is going to be better as I develop a larger data set. There's all kinds of parameters I like to play with and see what are the predictions based on these. Peter: Yeah, that last one, it's actually super important. I think it's one of the most common advice we give people over and over again. It's like, start with a small set of examples, then double it and see how much of improvement you get. You usually...if you double your amount of training data, then you get to see some linear improvement in your error rates. So if you have 10% error rate or something, and you double your training data, you're going to get down to maybe 8% error rate. And then you double it again, you get down to 6% error rate, and so on. If you can start seeing that trend, then you can suddenly get a sense of, ""How much would it actually cost me — in terms of labeling more data and so on — to get the result that I want?"" and so on. It's a very powerful thing to do. Lukas: Are the results of training these models reproducible? How much variability is there each time you fine-tune it? Would you get the same model if you fine-tuned on the same data two different times? Peter: In principle, you can set it up to be quite reproducible. If you basically train it on the same date..basically what you want to do when you train, is on each train iteration you have a batch of data, like a number of examples. You can actually...the API can set the batch size, how many examples per update you want. I think it defaults to 32 or something like that. When you do that, you also want to shuffle the data. You want to take a random sample of your training data. As long as you keep those randomizations consistent between your training run, you're essentially gonna get the same model at the end of it. It's going to be fairly reproducible. The only caveat is that, in practice — this is true, even for inference. We have a parameter called temperature where you can set the variability in the output. Higher temperature, the more variability — even if you put a zero there's no real guarantee that you're going to get completely deterministic output. There's enough noise and a little weirdness with floating point arithmetic and so on in these GPUs with these really big models, that it's very hard to guarantee complete determinism. We get people asking about that a lot, and the answer is always like, ""Well, unfortunately we cannot provide that, but you can get something that's fairly [?]."" But you should just make your experiment robust enough that you don't really care too much about the determinism. Lukas: I would think, operationally, having everyone have their own fine-tuned model would be much more of an infrastructure challenge than everybody using the API that hits the same model. Has that been a big undertaking to allow that to happen? Like, do you have to swap in and out of the different models as people start to use them? Peter: Yeah, no, for sure. When we started out, the way we did fine-tuning was basically...in some way, you almost rented a set of GPUs where the models ran on. For some of the absolutely earliest fine-tuning customers, we essentially charged them by GPU hour, to some extent. Like per hour, how much they were using the models. Even from the very beginning — I think like within six months after launching the API, we had a few select customers that had fine-tuned models and stuff like that — that's sort of the way it worked. The problem with that is, if you're trying something new, GPU hours are expensive. You don't want to really pay to reserve a GPU for like even a fraction of an hour. It just adds up really, really quickly. We just set a goal of saying, ""Well, as soon as you have fine-tuned your model, you should immediately be able to just use that model, and you should just have to pay for basically the tokens that go into it at inference time."" Like, whatever you put in your prompt. That was definitely a huge engineering challenge to make that experience really great. You just kick off your fine-tune, and when it's done get a fine-tuned model name out. Now you can use that model in the API to just get a result immediately. And you're not going to be charged by hour or whatever, you're just going to be charged the same way you're going to be charged for the API. That was really tricky. We have an amazing engineering team at OpenAI that really figured out a lot of tricks around balancing where these models end up, and cacheing them in the right way, and so on, to create a great experience around that. Boris: I'm curious if you fine-tune the entire model or you fine-tune just part of it to make it more efficient. Peter: There's just lots of tricks that we're using to make this happen. We're constantly trying to figure out new ways of doing it. There are challenges if you want to fine-tune a whole 75 billion parameter model. It can get really expensive and hard and so on, and there are tricks you can do to make it much faster. Lukas: Do you feel like the thing between you and everyone using GPT-3 for natural language tasks is more quality and performance of the model itself? Or is it something else? Is it something about integration, or monitoring in production, or something like that? Peter: Definitely the key things we focused on when we built the API was...what matters the most is really the capability of the models. Then number two is like, you need to have fast inference. Before we created our API, for language models nobody cared about the inference. Everybody cared just how quickly can you train them, because that's what It mattered, you know? So you can get your benchmarks resolved at the end of the day. We did just a ton of engineering to make inference super, super fast. I can remember over the course of the first few months of us getting the first prototype of the API to a customer starting to use it, we increased the inference speed like 200-fold or something like that. Lots of effort was done to make that super fast. The third thing is things around safety-oriented things. One of the reasons we invested in these InstructGPT models is that we saw that sometimes you can get surprising outputs of models, that you don't expect. For example, you might write a very innocent sentence and it might turn very dark for some reason, or you might get some biased outputs in different ways. With our instruct-oriented models, by default they behave in a much more expected way, but you can also specify the behavior in a much better way. It turns out when safety and capability come hand-in-hand...it just becomes a better product when you can control it better. Those are definitely the things we have focused on, and I think we're doing much better on than alternatives that are out there. But there's also...the third thing that we have put a lot of focus on is just making it really simple to use. The fact that you don't have to load up models, that you can just call a fine-tuned model that is just a single line of Python to call the API, that's also been really central to us. We want this to be easy to use by everyone. Lukas: Awesome. Well, thank you very much. It's really nice to talk to you and congratulations on making such a successful product. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work were really hard to produce. So check it out.",7766 +"Ion Stoica — Spark, Ray, and Enterprise Open Source",https://www.youtube.com/watch?v=-MVLURFH5nk,3222,2022-01-20,"Ion: When we looked and we thought about, we couldn't see a path for the company to be successful — a credible one — without the open source to be successful. And then once we reached that conclusion, we just...there was no other discussion, we just focused on that. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Today, I'm talking to Ion Stoica, who is maybe best known as the original CEO of Databricks, the company behind Spark. But recently, he's also started another incredibly successful company called Anyscale, which makes the open-source project Ray. On top of all that, he's a professor at Berkeley, where he runs the fascinating and super successful RISELab which is responsible for many of the most exciting startups of the past decade. This is a super fun conversation, and I really hope you enjoy it. Lukas: I think a lot of people listening to this will know about Ray and know about Anyscale, but for someone who's working in machine learning doesn't know about Anyscale, what does it do? Ion: Fundamentally, if you look at the trends, the demands of the new kind of applications, like machine learning applications or data applications, are growing much faster than the capabilities of a single node or a single processor. This is even if you consider specialized hardware, like GPUs, TPUs and so forth. Therefore, it looks like there is no way to support these workloads, other than distributing these workloads. Now, writing distributed applications is hard. And if more and more applications are going to become distributed, there is an increased gap between the desire of people to scale up their workload by distributing it and the expertise the typical programmer has. So Ray — let me start with Ray before Databricks — the goal is to make writing distributed applications much easier. It's doing that both by presenting a very flexible and minimalist API, and in addition to that, we have this very strong ecosystem of libraries, distributed libraries. Many of the people in the audience probably know [them], like RLlib for reinforcement learning, Tune for hyperparameter tuning, more recently Serve. But also we have a lot of other third-party libraries like XGBoost, Horovod, and so forth. Because at the end of the day, if you look at the most popular languages like Java or Python, they are not the most successful because they're the best languages. That's debatable. They're very successful because they have a strong ecosystem of libraries. Developers love libraries because if you have libraries for your particular application or workload, you make a few API calls and you are done instead of writing a thousand lines of code. Now, this is Ray, it's open-source. Anyscale is a cloud offering, hosted offering, of Ray. We are committed to building the best platform to develop, deploy, and manage Ray applications. This means higher availability, better security, auto-scaling functionality, tools, monitoring when you deploy application in production. On the developer side, we try to provide the developer the illusion of the experience of an infinite laptop. Because still most of the developers — we've done this survey, and others have done surveys — most machine learning developers still are loving their laptop and they're still doing a lot of things on their laptop. We want to preserve the experience of working on a laptop using the same kind of tools like editors and things like that, but now we want to extend that to the cloud. So you edit, you do everything on your laptop, but then when you run it, you can run it in the cloud. We package the application, all the [mechanization] to the cloud. And we run on the cloud, we auto-scale, so it's pretty much transparent. This is what Anyscale provides. But both Anyscale and Ray are really targeting to make scaling applications, in particular machine learning applications, as easy as possible. Lukas: That's, I guess, very conceptually simple, but clearly it's been a problem for a very long time, and you've put a lot of work into Ray and Anyscale. What makes it actually challenging to make a simple distributed framework? Ion: That's a great question. One lesson we learned is that people and developers, what they really prioritize, it's in some sense performance and flexibility. Even over reliability. I'll give you some examples. When I started Ray, we had only tasks. They are side-effect free. Tasks get some inputs from some storage, compute on that input, and then the result is also stored in this kind of storage. And then another task can consume. Now, that's a very simple model, and you can build a very robust system on that. This is from the lessons we learned also in the past with Spark. Because if for instance you lose some data, you can keep the lineage; the chain of tasks which created that data in the first place. And then you can re-execute the task if you know the order. And if the tasks are effect-free and they are deterministic, you get the same output. We're pretty happy about that. But then people started to want more performance. Here are where things started to fall apart. For GPUs, you don't want to just run a task, get the data in, and store the data. Because even transferring the data from the RAM, from the memory of the computer, in the GPU memory, it's expensive. And then, if your task also is doing this like TensorFlow, starting it, initializing all the variables, it takes a few seconds at the least. A bit more actually. This kind of overhead was kind of starting to be prohibitive. People ask for, ""Okay, I want my state actually to remain on the GPU,"" but then you don't have these kind of pure tasks. And now, it's much harder to provide this very nice model of fault tolerance. And then there is another thing. Reinforcement learning. People using reinforcement learning, they wanted to use it now for simulations or for rollouts, games. Some games are not open-source, and for these games which are not open-source, they keep the state inside. It doesn't provide you the state. You cannot extract the internal state. You can only see the screen. They make you take an action — moving left, right — and then you look at the screen and you read the screen. So because of this, we have to get the actors. With actors, it was much harder to provide this fault tolerance. We still tried it, initially, in our first paper. We tried to be very smart. We said, ""Okay, it's Python,"" so we make the assumption that inside each of these kind of actors, you have a single thread, sequentially. Basically you can order the methods which are executed on the actor. You can then sequentialize. You have an order, you record the order, and then you can re-execute or reconstruct the state. But guess what? People started to use multi-threading. Even if it doesn't work greatly in Python, they still use it. You cannot stop them. Then, we were thinking, ""Okay, we are going to simplify it."" Let's simplify our life, because we still want to make a system which is kind of...we want to understand it better and we want to try to provide some fault tolerance. We had these restrictions that if you create an actor, only the party which creates the actor can invoke a method on that actor. You have only one source, so at least it's easier to serialize the actions. But then, people started to want, ""Oh, I want to do something like [a] parameter server. And [for the] parameter server I want not only me to access this parameter server — which can be implemented as a bunch of actors — but others need to pass the actor handles."" But now you have...again, it's like this concurrency from different methods submitted by different actors or a task. So all of these things add, in some sense, complexity. And then if you talk about the fault tolerance...I'm coming back to the fault tolerance, because still it's important, especially in a distributed system. Leslie Lamport — the guy who did Paxos, and [a] Turing Award winner — his definition, long time ago for a distributed system, it's a system in which when a machine or a service you never heard about fails, the system stops working. Then we have to give up our ideal transparency on fault tolerance. And we said, ""Okay, we can restore the actors, but then the application has to do some work in restoring the state if it cares about it."" In a distributed system, these are the hard things. It's the performance and fault tolerance. And then in general, concurrency is the other thing. Because things happen in parallel, and it's on different machines now. And again, when you expose and you want to make it flexible, things are much harder. Because in something like Spark, you abstract with the parallaries. You don't give control to the user to write really parallel applications. So then you have more control. But again, the more you... Lukas: This does seem very similar to Spark in some ways. I assume that this was informed by your experience with Spark. Can you maybe talk about what was...maybe first describe Spark in contrast, and then talk about how that informed Ray? Ion: Totally. Spark was developed for data parallel applications. With Spark, you as a programmer, you see the controlled sequence. It's sequential. You write like a program. The difference is that one of these instructions now in Spark, on API, what happens under the hood is going to work on a dataset. And that dataset, whether it's — the first was Resilient Distributed Datasets, now it's Data Frame — is partitioned, with different partitions on different machines. So you have a dataset, and it's partitioned on different machines. And now you are going to execute a command on this dataset. And that command is going to execute in parallel on each partition under the hood. But when you write a program, you just operate on the data set, and you apply some function. The computation model, which is called bulk synchronous processing, is basically ""operate on stage."" Each stage, we have a bunch of basically identical computations operating on different partitions on the same data. Between stages, you exchange data. You can shuffle and so forth. You create another data set for the next stage to operate on. The basic stages is map may reduce. It's very synchronous. It's like, one stage operates on a data set, you do a shuffle to create another data set, you have another stage, and another stage, and another stage. For the programmer, you don't have control on parallels. Because you write one instruction, and the instruction grammatically, is at a data set level. It's only under the hood that you take that instruction, or function, and you execute on different partition. This is great for data. And obviously Spark has a great API or fantastic API for the data. Now, Ray is much lower level. Ray exposes parallels. Spark obstructs away parallels, Ray exposes parallels. So you can actually say, ""This task is going to operate on this data, and this task on this data. This is going probably to happen in parallel, and here are the dependencies between the outputs of these tasks."" And you have another task operating on these outputs from these different tasks. That gives you flexibility, but it's harder to program. On the other hand, in Spark — and in other systems — you have a master. This master is the only one which runs tasks because it launches all the tasks in a stage, in the state. For instance, in the case of Ray, a task can start other tasks or can start actors, they can communicate between themselves. In the case of Spark, and the other BSP systems, the tasks in the same stage cannot communicate between each other. They just work on their partition, and then how the changes are propagated, you shuffle to create another data set for the next stage. But, for humans, it's hard to write parallel programs. We are used to think sequentially. Even context switching for humans is hard, and context switching by definition is not necessarily that you do things in parallel. It's multitasking — do a little bit of this, a little bit of that — and even that is hard. We are not used to thinking parallel. This is difficult. This is hard. So that's why another reason for the libraries, because the libraries on top of Ray, they also abstract away parallels. If you use RLlib, or if you use Tune, you don't know what is running well, and you don't need to worry about that. But that's kind of the thing. It's a much more flexible, lower-level API. You know, I joke that if Ray will deliver on its promise, which I hope it delivers, and you developed Spark today, you'd develop Spark on top of Ray, and that's why you have others- Lukas: That was going to be my next question. That's great. Ion: Yes, yes. So, that's exactly the way it is. Ray fundamentally...another way to look at what it is, it's an RPC framework — Remote Procedure Calls — plus an actor framework, plus an object store which allows you to efficiently pass the data between different functions and actors by reference. That's what it is. Instead of always copying it, you just pass the references. That's it. That's where the flexibility comes from. Lukas: When you were working on Databricks or Spark, were there use cases that you were seeing that made you want to develop Ray? Or was it something that you always wanted to create? Ion: No, no, no. One thing happened, and I'm a believer in this. You should develop a new system. Existing systems do not provide the functionality you need. And before you develop this new system, you better try to implement what you want on the existing systems. Lukas: Sure. Ion: When we developed Spark, it was in 2015 I believe, in fall. I taught a class, a graduate class. I was still the CEO of Databricks at that time. Robert and Philip took that class. It was a systems class, they were machine learning students. Their project was about data parallel training. Obviously, I ask them, ""Okay you use Spark for that. It's good."" So they did use Spark. Actually, they modified it a little bit, they called the modification SparkNet. But then, there were a few challenges. Spark was too rigid. With reinforcement learning, the computation model you need is much more complex, so to speak. You need nested parallelize and things like that. Spark, again, was too rigid. It was fantastic for data processing, but now you needed a lot more flexibility for something like reinforcement learning. It wasn't a good fit. And the other thing, Spark is in Java, JVM and Scala. It didn't — at least at that time — have very good support for GPUs, Java. That's why we started then to develop Ray. Robert and Philip developed something for themselves, to start. Lukas: That's great. I mean, I also would love to hear the story of Spark. I remember a time when Hadoop had the same value prop and everyone was really excited about it. It seemed like Spark replaced it in such a massive way, that I think you rarely see with technologies. I'd love to hear what the use case was that drove the development of Spark and why you think the switch happened so quickly. Ion: You know that's a great question. The story there, it also started from a class project. This was in 2009, in spring. I was teaching this class — again, graduate class — and it was cloud computing services and applications. Something like that. One of the projects there was to have cluster orchestration. The problem was you wanted the same cluster to be able to run multiple frameworks, to share the same cluster across different frameworks. One use case, it was actually upgrading. Hadoop at that time was not very backward compatible. If you have a new version, it was a big deal to upgrade it. Most of the development deployments are on-prem. So now it's hard on-prem to come up with another cluster to test the new version before you are going to move to the next version. Therefore, if you had the ability to run two Hadoop versions, side-by-side, on the same cluster, this would be much better, and a great value proposition at that time. Initially the system was called Nexus, but then someone told us from academia, that this is a bad name because they already use the name. So it's a name conflict. So we changed it to Mesos. Maybe some of you may remember about Apache Mesos, that was a precursor of Kubernetes. On this project, there were four people. It was Matei Zahariah, it was Andy Konwinski, Ali Ghodsi, and Ben Hindman. With Mesos, one of the value propositions is, you have all this framework and it's going now to make it easier to build a new data framework on top. Because Mesos, well they care about some isolation between the frameworks, and doing some heavy lifting, about detecting the failures, things like that. Doing some scheduling. You'll see that Spark, one of the reasons it was developed was as a showcase for Mesos. Because now it's easier writing a few hundreds of lines of code, to develop a new framework like Spark, and run it on Mesos. So this was happening in mid-2009. So what was the use cases? The primary use case was machine learning. It's a great story there. That was RADlab, and then was AMPlab, and then the RISElab. Each lab is almost like five years, where everyone...people from different disciplines were sitting together, the same open space, Meaning machine learning, databases, system people, all together. Around that time was also this Netflix challenge, the prize, you remember? It's $1 million prize for developing the best recommendation system. We had a Postdoc, Lester, come to us, like, ""Okay, it's a lot of data. What do we keep, what should we do, what can we use? You are the system guys. Tell us what we should use."" Well, you should use Hadoop. We are working with Hadoop. Okay, Lester went and used Hadoop, we showed him how to use. But then he comes back to say, ""Well, this is super slow. It analyzes big data, it doesn't run out of memory, but it's so slow."" And obviously it was slow because most of the machine learning algorithms are iterative in nature, right? You start, you ingest more data, you find a model until you get a model you are satisfied with the accuracy. It converges. Each of these iterations was translating in a MapReduce job. And each MapReduce job was reading and writing the data from the disk. And at that time, the disc was slow disc drives. So it took forever. That was one of the use cases. The other use case was query processing. Also, at that time, what happened, everyone — at least some large companies — are adopting Hadoop to process a large amount of data. After all it's MapReduce, Google was doing MapReduce, must be good. But now you have...also, these other people like database people, and they're looking at the queries, the data and so forth. But now you have all this other huge data with someone else, and they're asking for access to the data. We said, ""Okay, you get access to the data, the only thing you need to do is write this Java code, this MapReduce code, and you can process the data."""" These people, that was not what they are doing. They were doing SQL, writing SQL statements. And then people started developing Hive — or I think Facebook and Pig Latin from Yahoo — a layer on top of Hadoop, which provides some query language which is similar to SQL. So you get that, you have this system, now you can do the query on it. The problem when you do a query on that, these people are coming from databases. They write a query, they get the answer. Here you write the query, well...come back in 2 hours to get some answer. So it was slow. So these are use cases that Spark targeted. And the way it targeted that was keeping as much as possible of the data set in memory. The trick which Spark had at that time was not only to keep the data in memory, but how do you ensure resilience? Fault tolerance. Because that was a big deal. If you remember all of these things about...actually, even building big computers, clusters, from commodity servers, it's coming actually from Berkeley. It was a project which was called NOW, Network of Workstations, in nineties. Before that, you want a lot of power and so forth, you buy a supercomputer. But now you have this commodity servers, guess what? They fail. So this kind of work was very ingrained. You need to provide fault tolerance. That's why Hadoop puts the data on the disc. If it's on the disc, hopefully it's durable because it's creating three copies of each data. So you take care of that. But in case of Spark, now you keep the data in memory. So how do we do fault tolerance? You do fault tolerance like I discussed earlier, because you have only tasks. The tasks, they don't have side effects. You keep the lineage of task, you record that. If something fails, you re-execute the task which created the data you lost because of the failure. That was Spark. So now, because the data is in memory, machine learning applications are going to run much faster. Because between iterations, the data is still in memory. And by the way, it was also more flexible as a computation model because Hadoop has only MapReduce to stages. But here you can chain a lot of more stages. And obviously if the data is in memory, the queries are going to return much faster, even if you have to scan the entire data which is in memory. These are kind of the use cases which powered Spark. And now you are saying, you are asking, ""Okay, how is it displaced?"" You see, Hadoop — in some sense — it was a lot of hype. For good reasons, but it was still in a bubble. In 2000...it was quite amazing, because everyone — at least in the tech world — they knew about Hadoop and big data. The number of Companies, like in 2012, 2013, that period, there are not a lot of companies using Hadoop. The summit, Hadoop summit, was like, 300, 500 people. Maybe 700 people. It was like a bubble. And then Spark came into that bubble and it says, ""We are going to provide a better computation engine. And we are going to work."" Because Hadoop has two parts. A computation engine, which is MapReduce and HDFS, which is this file system. Initially, it was a fight...not a fight, but it was...Ray was viewed for a long time that it only can operate with small data which fits in memory. But when we started, it wasn't anything difficult to operate on the data on the disc and Ray was actually doing that from day one. But a focus on in memory, because that was what was doing particularly well. Then it was a very smooth replacement, because it was now another engine in the same ecosystem and then Cloudera bet on it in 2013 at the end. And then it snowballed from there. Lukas: Was it obvious that there was an opportunity to start a company around Spark? Ion: Initially, we built Spark and it was an academic project. People started to use it, and the obvious question was, ""Well, from a company, am I going to bet? I like Spark, but can I bet on it?"" What happens when Matei or whoever graduates? What happens to the project? We really wanted to have an impact, because we saw this as a much better way to do data processing. We saw the data is a big problem. There were two ways to go about it. You need eventually to have a company behind the open-source, to make open source a viable solution, at least for large organizations. I'm not going to give names, but we went to a Hadoop company — we were friends with Cloudera, Hortonworks, and so forth, even MapR. We knew people, they were actually sponsors of our lab at Berkeley, we were meeting all the time — and we asked actually, don't you want to take over Spark? But they didn't, because there were other plans about what will come after Hadoop and MapReduce, as a computation engine. And then it just happened, times aligned. I was about to take a leave, Matei was graduating and all the other people...Andy and Patrick were already thinking about creating a company. So it's all coming together and we say, ""Okay, let's start a company."" We had a lot of discussion when we started the company, and one of the big questions is company success predicated on the Spark success, open source success. Remember, when we started, things were not very clear. We started in 2013. We started to talk about the company in the fall of 2012. When we looked around you have Linux, which is a pretty special phenomenon. But if you look at that time, there was no unicorn based on open-source. It was MySQL, but only later was sold to Oracle. Lukas: Cloudera wasn't big yet? Ion: Cloudera was not big enough. Hortonworks was small. It wasn't big enough. It's only one or two years after, that we started to have these big rounds of valuation of four point something billions. Also, Cloudera was...people think actually they're Cloud-era because they initially wanted to do in the cloud, but they saw that it's not enough business in the cloud, and probably it was true then. And then they pivoted into on-prem. We started the company. Long story, but we decided to go with — at that time — with this new business model. We only provide the hosted version of Spark on the cloud, initially only AWS. We decided that the success of the open source is necessary to the success of the company. We're saying, ""Okay, if the open source is going to be successful, then if we build the best product for the open source, hopefully we are going to get these customers."" Even if initially, maybe there will be other open source companies providing Spark, or cloud themself. Because then Cloudera provided Spark to their users, then MapR, then Hortonworks, and obviously AWS, Azure, and Microsoft, with aging inside. We committed, we bet on the success of the open source back then. And we put a lot of effort into that. Lukas: It seems like now, businesses built on an open source model is an incredibly popular strategy for infrastructure companies. Ion: Yeah. Lukas: Do you ever... Ion: Databricks was one of the first to do that. Before then, it was on-prem, and that was a business model. The business model of on-prem was a little bit heavier, much heavier. And remember that some companies founded at the same time, they failed. Even if the open source would be huge...well, not failed, but they're not as successful as people believed. It wasn't clear at all. I mean, that was, at that time, a pretty big bet. We got very hard pushback and a lot of pressure to go on-prem, at least initially. But now, building a hosted offering for an open source is quite common. Lukas: Why do you think the popular deep learning frameworks, like TensorFlow and PyTorch, don't have something like that hosted in the cloud? Even though enterprises generally use them, that business model doesn't exist there. Ion: It's a great question. I can just...obviously this is hypothesis. For PyTorch, you have GridAI right now- Lukas: That's true. Ion: -providing some of the hosting. I think that these are coming from large companies — open sources from large companies — and these companies themselves are not interested to monetize. The way, for instance, probably Google thinks about TensorFlow, the monetization, is that TensorFlow and everything would work best on GCP, in particular using TPUs, and that's how they're going to monetize. The best place to train TensorFlow models which use Tensorflow is going to be GCP. The same with Kubernetes. It's hard for a company which doesn't have the creators of an open source to create a business, to...it's harder. If you don't have the creators of that particular open source part of the company, then it's just harder. You cannot orchestrate, you cannot develop in sync the open-source and the offering. I'm not aware about a huge success so far, of a company behind Kubernetes. But, how could you do that? Most of the Kubernetes developers are still with Google. So I think it has to have something to do with that. And the other thing is about...hosted offerings are more valuable when the solution is distributed, because then the value is to manage a cluster. As long as you run on only a single machine, the value is a little bit less. Now, of course, TensorFlow can run on different on a bunch of machines and so forth, and it's TensorFlow distributed. But I do think that these are the two things. One, most of the uses of PyTorch and TensorFlow are still on a single machine. And the second one, most of the developers of these open source libraries are still with these large companies like Google, Facebook. I may be wrong, but that's what I think are the differences, at least these are some differences. Lukas: Interesting. Lukas: I guess another question I wanted ask you — as someone who started these two very, very successful companies — do you think that the humans you picked as co-founders had anything to do with that? Was there something that you saw in them, or some commonalities between the co-founders that you picked that you think made them effective? Ion: Oh, absolutely. Absolutely. And you know that, Lukas. The people are so important. I mean, I'm telling everyone that the things I'm the most proud of at Databricks — and I'm saying Databricks because it's an older company, so you can see...it's more time to observe — I am proud of the original team. At some point, to be successful, you need everything. Including being lucky, right? But I think that the people were quite complementary. They have all — despite the fact that they all have a lot of accomplishments — relatively low ego. We were very open and we are a team. Like Matei, I know him since 2006, 2007 when he joined Berkeley. Ali came to Berkeley in 2009. Andy was also there at that time, then Patrick. So we knew each other for a long time. We were together. We were very open in discussing any issues. We were not always agreeing, we had shouting matches and so forth. I remember that later, people told us that this small office in Berkeley — and we didn't realize — but when we are having these very passionate exchanges, people are hearing almost everything because there is not good isolation. It was at some level scary, because you have these people who are supposed to lead the company and they don't agree, on even probably basic things. But we were very comfortable debating. I think that it's the same with Robert and Phillip at Anyscale. It's again low ego and so forth. I think that one thing you want from everyone, including the CEO and so forth, you want everyone to put the success of the company above what everyone's goals are. Because these things, which is absolutely true. What is the saying, ""There is no winner in a losing team,"" right? Lukas: Right. Ion: I think this is what I would say. You need...when you know people for a long time, you have that trust. Trust is absolutely fundamental. Because there are good things, and there are high and low points in the life of every company. I imagine a small company is like you have a plane which is flying very close to the ground. There is not a lot of room you have there. I'm not saying that everyone is absolutely humble or whatever, but absolutely they need to believe that the most important thing is success of the company. Lukas: When you set up Ray as a business, you had been running Databricks for a while and it was starting to see real success. I imagine you were quite a different person. Did you think about starting that company differently than starting Databricks? Ion: What strikes me is how much great feedback you get from people, and how much of this feedback you ignore. If I think back, it's about fundamentals. Everyone knows what you need, at least in theory, to build a great company. Of course, you need to have a great team. You need to have a vision, strategy. You need to really focus on the product market fit. Early customers, make them super successful, iterate from there. Everyone is like...you know how to do it. But what strikes me is how hard it is to do it. People don't do the wrong things because I don't think, in general, they don't know what is the right thing. They do the wrong thing because doing the right thing is very hard. Imagine that you are going to San Francisco, or pick your favorite city, and you are going to ask passers-bys, ""What does it take to be successful?"" What will people say? You need to work hard, focus, have a little bit of luck. Things like that. Everyone will tell you what you need...be driven, persistent, whatever they're going to tell you. You are going to get a lot of similar answers. All of them actually know what it takes, but how many people do it? And the reason is, it's just hard to do it. It's damn hard. When I'm looking back, there are some things we sticked with at Databricks. Like, we picked the cloud, why did we pick the cloud? Because it was focused, we wanted to focus on something. We realized early on that developing for cloud and for on-prem, it's a pretty different engineering process. You need to come up with two teams. We are not even sure that we can build a great engineering team doing one thing, let alone two. So, things like that. We were thinking that, ""Okay, we are fine to do the cloud because we believe the cloud market is going to be big enough for us."" If you tell me that the on-prem market is whatever tens of billions or whatever — I don't remember that time — what can I do about? I have 40 people, or 80 people. In order to capture any sliver of that market, it'll take years. So why focus on that now? These are the things. What I'm trying to say is that we didn't do anything other than, in some sense, the basics. And the same thing with Anyscale. You try to focus on where you want to innovate and the rest, you just try to use the state of the art solution. So, now how was it different? It just, in some sense, makes you more confident that these basic things are working. It also makes you more sure that approaching them is very rarely shortcuts. Just hard work. And then it makes you appreciate even more — I didn't say it so far, but it's probably the most important thing — it's about how important execution is. Like John Doerr was saying, ""Ideas are easy. Execution is everything."" And you get some people who make such a huge difference. Like Ron Gabrisko, who was eventually our CRO. He joined us when the company was a few millions ARR, and took us ro many hundreds or millions now. It's about having...or like with hiring. Everyone is telling you back channel references are so important. But it's hard because it requires effort. Everything requires effort. I cannot unfortunately tell you there is any silver bullet or is anything. Just stick with the basics, and think about there is no shortcuts. You also need to think that every company is different. It has to be different. If you think that they're the same, something probably is wrong. Because things change. Like for instance, Anyscale versus Databricks. When we started Databricks, AWS was the biggest cloud. Right now you have multiple clouds, you cannot ignore it. GCP, Azure. When we started Databricks, it was data scientists, the main people we focused on. Then data engineers and so forth. Here it's more developers and machine learning developers. And different users obviously want different things. Again, it's nothing earth-shattering. It's something obvious. But I think there are these little things. And then again, it's execution, it's speed. Lukas: Was it hard? It's funny, I'll just say, I get asked that question a lot, about second time company, what do you do differently? I answered almost identically to you, where it's like, you know what you're supposed to do, but in the details you kind of do little things better. But I'm curious, because one thing that's different about your experience than mine is, the second time, you're founding a company and you're not CEO that time. Was it hard to work with Robert in some ways? I mean, he seems very impressive and smart, but I think it might be his first corporate job in his life. Ion: Yeah. Lukas: You must feel like, ""I know how to do this and you're not doing it,"" or did that not happen? Ion: No. I think the reason I was CEO at Databricks early on was because no one was sure that he's going to do it long term. Actually, Ali wanted to go to academia. Obviously Matei had a job at MIT, he was on leave. And so forth. With Robert and Philip, they didn't look at anything else. They didn't interview anything. This is what they wanted to do. I think that this arrangement, right now I really like it. And again, we worked together since 2015. Four years before we started the company, so we know each other very well. I think that in terms of responsibilities, Robert, myself, and to some degree Philip, we divided pretty well. As you know, as CEO there are so many things to do. Having someone you can rely on and split some of the responsibilities, help solve... Lukas: That makes sense. Lukas: Well, we're going slightly over time and we always like to end with two questions, which I'll do with you. But maybe I'll make the second to last one more specific. Is there another project like Spark or like Ray that you're dreaming of, that you would do if you had more time? Ion: I think the one thing I'm looking now — and this could be another next lab at Berkeley — we're looking about...tentatively we call sky computing. It's multi-cloud, but think about internet for the clouds. Fundamentally the belief here is that...what internet did is it stitched together a bunch of disparate networks and provides the abstraction of a single network. Therefore, when you send your package, you don't know through which networks your package will travel. I think that what we are going to see more and more, will be an emerging of this layer – we call the intercloud layer — which will abstract away clouds. You see the early signs. It will lead also to specialized clouds. You can think about for instance...you have a machine learning pipeline, you have data processing, you have training, you have serving. Each of these components you can actually run on different cloud for good reasons. Like for instance, maybe you process confidential data and you want to remove the P-II information from the data. You can decide to do that on Azure, because it has Azure confidential computing. You can decide to do training on TPUs, and you can decide to do serving using Amazon Inferentia, new chips. I think you're going to see also the rise of more specialized clouds, especially for machine learning. There is an announcement from NVIDIA that is Equinox, which is...it's really like GPU-optimized data centers, tightly built. So I think that's kind of something very exciting. You can see that...you look always at the trend and there are these kinds of evolutions in which the clouds, by necessity or driven by open source, provide more kinds of similar services. This provides a very good ground for emerging the next level to abstract away. Lukas: Wow. What an intriguing answer. It seems a little crazy, but I'm kind of convinced as you talk about it. Ion: Well, I mean, you asked for it. I think there are many projects, but I think this is one...I think it will happen. And by the way, with every company, with everything probably you want, you need to take a bet, right? You need to make a bet. If you don't make a bet, you are doing what everyone else is doing. You guys made a bet. What you are doing is absolutely not obvious, when you started. Lukas: Yeah. I agree. Ion: This can be a great company. Lukas: Totally. Ion: You have to. And if you are wrong, at least you tried. Lukas: All right. Well, usually we end with the question for ML researchers. We ask, ""What is the hardest part about getting a machine learning model into production?"" But I think for you, you're a company builder — also an academic — but I think as a founder, what is the hardest part that people wouldn't see from the outside about building a really successful big business? Ion: I think that, probably the hardest thing, one of the hardest things, is that obviously with each company there are ups and down. I think that when things are down, you may need to make corrections. It can be down because a product doesn't deliver, maybe because you are on the wrong path with the product, maybe because the wrong people...or I mean, not the best fit. I think that when things go well, it's easy. It's great. But I think there, it's about always going back to the fundamentals. Trying to not be emotional and trying to always look at the facts, whether there are trends in industry — you look at the data —, whether there are data coming from the customers, whether there are facts with respect to someone who maybe is not the best fit. When things are hard or good, it's like...we are humans, we are emotional. I always found that it's hard to split and push the emotions to take second seat, and try to think only about facts when you make the decisions. The harder things are, the more emotional you are. Because you take it personally and things like that. So I think this is what I found to be the hardest thing. And, I'm also an emotional person. At some degree, I'm getting also really excited. But that's kind of what I found. I found that, in general, when you kind of try to make decisions based on emotions — at least in my case, I think for some people it works. It's gut feeling and so forth — for me, it did not work. Lukas: Do you have any tricks for managing your emotions and thinking clearly under stress? I'm asking for a friend by the way. Ion: Yeah. I'm trying to simplify the problem. There are many things coming when you are under stress, and I try to say, ""Okay, what is the most important thing?"" and try to forget about everything else and then try to simplify the problem. Then it's easier to make a decision based on what is the important thing. I think that's what I discovered, especially when it's very hard to make a decision because there are multiple dimensions associated with the decision. Like for instance, I mentioned to you earlier on, we had a lot of discussion when we started Databricks. Okay, it's important for the data open source to be successful because now we have a company, now we need to build a product to have some revenue at some point. Obviously, there are four possibilities. The open source is successful, the company's not successful. Open source successful, company successful. Good, good, good. You have all this for a 2x2. When we looked and we thought about, we couldn't see a path for the company to be successful — a credible one — without the open source to be successful. And then once we reached that conclusion, we just...there was no other discussion, we just focused on that. Yeah, we just tried to find methods to simplify it and hope that everything else we didn't consider, we follow up. As long as you are focusing on the main thing, everything else will follow up. That's my...I'm trying to oversimplify it, and sometimes maybe it's not good. Try to think about what is the most important things I need to solve, what is the most important dimension? Lukas: Well, that's good advice and a good spot to end, I think. Thank you very much. That was a fun interview. Ion: Thank you. Thank you. Bye-bye. Lukas: Appreciate it. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work were really hard to produce. So check it out.",7786 +Stephan Fabel — Efficient Supercomputing with NVIDIA's Base Command Platform,https://www.youtube.com/watch?v=SWVoticj4jE,3121,2022-01-06,"Stephan: Scheduling on a supercomputer typically is by Post-it. It's, ""Joe, it's your cluster this week but I need it next week."" It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. This is a conversation I had with Stephan Fabel, who is a Senior Director of Product Management at NVIDIA, where he works on the Base Command platform software that runs on top of NVIDIA's DGX machines, which are basically the most powerful computers that you can buy to train your machine learning models on top of. It's fun to talk about the challenges that customers face when they have access to basically unlimited compute power. This is a super fun conversation, and I hope you enjoy it. Lukas: My first question for those who haven't heard of NVIDIA Base Command, since you are the senior product manager on it, can you tell me what Base Command aspires to do? Stephan: In a way, think of NVIDIA Base Command as your one-stop shop for all of your AI development. It's a SaaS offering from NVIDIA where you log on directly, or you log on via an integration partner, and you leverage the capabilities of Base Command to schedule jobs across a variety of infrastructures. You do that in a secure manner. You gain access to your data and retain access to your data and data sovereignty, across the infrastructure that you're scheduling the jobs on. Then it's really just a matter of optimizing that job run on NVIDIA infrastructure. That's really what Base Command aims to do. Lukas: These jobs, they're model training jobs exclusively, or is it broader than that? Stephan: Model training jobs are generally the ones that we focus on, but we also do model validation, for example. You could have single-shot inference runs as well. Lukas: Are there other pain points of model of development that Base Command aspires to solve or tries to solve? Stephan: Yeah. I think that a lot of the issues that you have with AI infrastructure, it's really that's where it starts. The question is, ""Where do you train your models?"" and ""How do you go about it?"" Most people start in the cloud to train their models. That's reasonable because just any development effort would start in the cloud today. At some point you reach a certain amount of scale where you say, ""Well, it may not deliver the performance I need, or it may not deliver the scale I need, at the economics I'm comfortable with,"" et cetera. For those high-end runs, you typically look at infrastructure alternatives. Then the question becomes, ""Okay, I already am used to this whole SaaS interaction model with my AI development. How do I maintain that developer motion going forward?"", where I don't have to teach them something new just because the infrastructure is different. What we have at NVIDIA is this DGX SuperPOD. The idea is to say, ""Well, how about we try this and develop Base Command as a way to access a SuperPOD, just as a cloud API would behave?"" Lukas: A DGX SuperPOD, is that something that I could put in my own infrastructure or is that something that I could access in the cloud or both? How does that work? Stephan: Typically, our customers for SuperPODs...maybe we should take a step back and understand what it is. The easiest way to think about — or the most straightforward way to think about — a DGX SuperPOD is to think of it as a super computer in a box. It's a packaged-up infrastructure solution from NVIDIA that you can purchase, and it'll be deployed on premise for you and your own data center or in a colo facility. Actually we found that a colo facility is the most likely place for you to put that because it is a pretty intensive investment. Number one, not just in terms of just the number of DGXs that are involved, for example, but also of course, in the terms of the power draw and cooling and just the requirements that you need to bring to even run this beast, essentially. That's really what then dictates where this thing usually is. What we did is we put it in a colo facility and made it available right now in directed availability fashion. We have a couple of golden tickets for some customers who want to be on this thing, and then they get to select the size of the slice they want and access that through Base Command. Lukas: I see. When you use Base Command, you're using DGX, but it's in NVIDIA's cloud and you get kind of a slice of it. Is that right? Stephan: Yeah, that's right. I know we call it NVIDIA GPU Cloud, but really think of the whole Base Command proposition today as a SaaS portal that you access, that is currently coupled to more like a rental program. It's less cloud bursty elastic; think of it more like, ""Okay, I have three DGX A100s today, and then maybe in the next couple of months, I know I need three more. I'll call NVIDIA and say, 'Hey, I need three more for the next month.'"" That's kind of how that works. Lukas: Maybe let's start with the DGX box. What would a standard box look like? What's its power draw? How big is it? How much does it cost? Can you answer these questions? Just in order of magnitude. Stephan: You're looking at about $300,000 for a single DGX A100. It'll have 8 GPUs and 640 gigabytes of memory that come along with that. Those are the A100 GPUs, the latest and greatest that we have. You're going to look at about 13 kilowatts per rack of standard deployment. Lukas: 13 kilowatts? Stephan: Yeah. Lukas: Constant or just training? Stephan: No, no. When you fire these things up, these puppies, they heat up quite a lot. They're pretty powerful and the DGX SuperPOD consists of at minimum 20 of those. If you think about that, that's what we call one scale unit. And we have customers that build 140 of those. Lukas: Wow. What kinds of things do they do with that? Stephan: Well, just all the largest multi-node jobs that you could possibly imagine, starting from climate change analysis. Large, huge data sets that need to be worked on there. NLP is a big draw for some of these customers. Natural language processing and the analytics that comes with those models is pretty intensive, data intensive and transfer intensive. We keep talking about the DGXs and of course we're very proud of them and all of that, but we also acquired a company called Mellanox a year ago. So of course the networking plays a huge role in the infrastructure layout of such a SuperPOD. If you have multi-rail InfiniBand connections between all of those boxes and the storage, which typically uses a parallel file system in a SuperPOD, then what you'll get is essentially a extreme performance even for multi-node jobs. Any job that even has to go above and beyond multiple GPUs, a DGX SuperPOD architecture will get you there. Essentially at the, I would say, probably one of the best speed performance characteristics that you could possibly have. The SuperPOD scored number 5 on the top 500. It's nothing to sneeze at. Lukas: How does the experience of training on that compare to something that listeners would be more familiar with, like a 2080 or 3080, which feels pretty fast already. How much faster is this and do you need to use a special version of TensorFlow or PyTorch or something like this to even take advantage of the parallelism? Stephan: I'd have to check exactly how to quantify an A30 to an A100, but think of it as this. Any other GPU that you might want to use for training in a traditional server, think of it as a subset of the capabilities of an A100. If you use, for example, our MIG capability, you can really slice that GPU down to a T4-type performance profile and say, ""Well, I'm testing stuff out on a really small performance profile without having to occupy the entire GPU."" Once you have the same approach from a software perspective...if you do your sweeps, then you do essentially the same thing. Or you could do those on MIG instances and then thereby you don't need that many DGXs when you do it. I guess I should say that that's the beauty of CUDA. If you write this once it'll run on an A30, it'll run on an A100, it'll run on a T4. In fact, we provide a whole lot of base images that are free for people to use and to start with, and then sort of lift the tide for everybody. These are pre-optimized container images that people can build on. Lukas: I would think there'd be a lot of networking issues and parallelization issues that would come up, maybe uniquely, at this scale. Is that something that NVIDIA tries to help with? Does CUDA actually help with that? I think of CUDA as compiling something to run on a single GPU. Stephan: Absolutely. If you think of CUDA as a very horizontal platform piece in the software stack of your AI training stack, then components like NCCL, for example, provide you with pretty optimized communication paths for multi-GPU jobs, but they'll also span multi-nodes. This starts from selecting the right NIC to exit a signal, because that means you're going to the right port and the top of the rack switch. That means you minimize the latency that your signal takes from point A to point B in such a dataset center. When you look at CUDA, and especially at components like NCCL and Magnum IO as a whole — which is our portfolio of communication libraries and storage acceleration libraries — it starts from the integration of the hardware and the understanding of the actual chip itself, and then it builds outward from there. The big shift at NVIDIA that we're looking at accelerating with use of Base Command is this understanding that NVIDIA is now thinking about the entire data center. It's not just about, ""I got the newest GPU, and now my game runs faster."" Certainly that's a focus area of us as well. But if you take the entire stack and work inside out, essentially, then the value proposition just multiplies the further out you go. With Base Command, this is sort of the last step in this whole journey to turn it into a hybrid proposition. I know it's very high-level right now and abstract, but it's a super interesting problem to solve. If you think about how data center infrastructure evolved over the last, let's say 10 years or so, then it was about introducing more homogeneity into the actual layout of the data center. Certain type of server, certain type of CPU, certain type of top-of-rack switch, and then a certain layout. You have all these non-blocking fabric reference architectures that are out there and et cetera, et cetera. Ultimately now that everything is homogeneous, you can now make it addressable using an API because everything is at least intended to behave in this very standard and predictable way. We worked our way up there. This has never been the case for something like a supercomputer. A supercomputer was a 2-year research project with a lot of finagling and ""Parameters here, and then set this thing to a magic value and that thing to a magic value, and then run it on 5 minutes after midnight, but not on Tuesdays,"" and then you get the performance. This whole contribution that we're really making here is that we're raising that bar to a predictable performance profile that is repeatable. Not just inside an NVIDIA data center, where we know 5 minutes after midnight and so on, but also in your data center or in an actual random data center, provided you can afford the cooling and power of course. But then once we got that out of the way, we're pretty good. That's a real shift forward towards enabling enterprises, real bonafide true blue chip companies, to actually adopt AI at a larger scale. Lukas: It's interesting. One thing I was thinking of as you were talking is, most of the customers that we work with...we don't always know, but I think what we typically see with our customers that are training a lot of machine learning models, is they use a lot of NVIDIA hardware, but it's less powerful hardware than the DGX. It might be P100 or basically whatever's available to them through Amazon or Azure or Google Cloud. I think they do that for convenience, I think people come out of school knowing how to train on those types of infrastructure. Then their compute costs do get high enough. I mean, we do see compute costs certainly well into the seven, eight figures. Do you think that they're making a mistake by doing it that way? Should they be buying custom DGX hardware and putting that into colo, would they actually save money or make their teams more productive if they did it that way? Stephan: Oh God, no. Just to be really clear, Base Command is not a cloud. We're not intending to go out there and say, ""Go here instead of Amazon,"" or something like that, that's not what we are saying. First of all, you can get A100 instances in all the major public clouds as well. You could have access to those instances in just the same way that you're used to consuming the P100s or V100s or anything like that. Whether it's Pascal or Volta or Ampere architecture, all of it is available in the public cloud. Like I said in the beginning, it's just a perfectly acceptable way to start. In fact, it's the recommended path, to start in the cloud, because it requires the least upfront investment. I mean, zero. And you get to see how far you can push something, an idea. Once you arrive at a certain point, I think then it's a question of economics, and then just everything will start falling into place. What we found is that enterprises typically arrive at a base load of GPUs. In other words, at any given moment in time, for whatever reason, there is a certain number of GPUs working. Once you identify that, ""Hey, every day I keep at least 500 GPUs busy,"" then typically the economics are better if you purchase. Typically, a CapEx approach works out better. It's not always the case, but typically that might be the case. To meet that need in the market is where we come in. What Base Command right now offers is this...it's not the all the way ""Purchase it"", you don't have to have that big CapEx investment up front, but it is something in between. You do get to rent something, it's not entirely cloud, but you're moving from the Uber model to the National Car Rental-type model. Once you're done renting, then you maybe want to buy a car. But the point is that there's room here on that spectrum. Currently we're right smack in the middle of that one. That's typically what we say to customers. Just actually yesterday, somebody said, ""Well, how do you support bursting? And how elastic are you?"" I said, ""That's not the point here."" You want to be in cloud when you want to be elastic and bursty, but typically that base load is done better in different ways. Lukas: What breaks if I don't use Base Command? If I just purchased one of these machines and I'm just shell-ing into the machine and kicking off my jobs the way I'm typically used to or running something in a notebook. What starts to break where you know that you need something more sophisticated? Stephan: On the face of it, nothing really breaks. It just takes a lot of expertise to put these things together. If you buy a single box, then there's probably very little value add in adding that to a SaaS platform, per se. But as soon as you start thinking about a cluster of machines — and like I said, more and more of our enterprise customers are actually thinking about deploying many of those, not just a single machine — then as soon as that comes into play, then you're faced with all the traditional skill challenges in your enterprise that you'd be used to from just rolling out private cloud infrastructure. It's the same exact journey. It's the same exact challenge. You need to have somebody who understands these machines and somebody who understands networking, somebody who understands storage, Kubernetes and so on and forth. As soon as you build up the skill profile that you need to actually run this infrastructure at scale and at capacity, then you're good to go, right? You can build your own solution, but typically what you'd be lacking are things that then help you make the most of it. All the kinds of efficiency gains that you'd have by just having visibility into the use of the GPU. All the telemetry and the aggregates by job and by user and by team. This entire concept of chargeback, et cetera, is a whole other hurdle that you then have to climb. What we're looking at is people who want to build a cluster, typically they want to do that because they want to share that cluster. It's a pretty big beast. If you build a big cluster, might as well, because you want to be more efficient and you want to make the most of it, and so now you need to have a broker who brokers access to the system supercomputers. As ridiculous as it sounds, scheduling on a supercomputer typically is by Post-it. It's, ""Joe, it's your cluster this week but I need it next week."" It doesn't work that way at scale anymore. You want to interact with something that is actually understanding the use of the cluster, optimizing its use so that the overall output across all of the users is guaranteed at any given point in time. Lukas: I have the sense that many years ago, decades ago, when I was a kid or maybe even before that, supercomputers felt like this really important resource that we use for lots of applications. Then maybe in the nineties or the aughts, they became less popular and people started moving their compute jobs to sort of distributed commodity hardware. And maybe they're kind of making a comeback again. Do you think that's an accurate impression? Do you have a sense of what the forces are that makes supercomputers more or less interesting, compared to just making a huge stack of chips that you could buy in the store? Stephan: Yeah. It is interesting because if you think about it, we've actually oscillated back and forth between this concept a little bit for years. I mean, you're exactly right. The first wave of standardization was, ""Let's just use 19-inch rack units and start from there and then see, maybe that's a little bit better."" Then sort of the same thing happened when we decided to use containers as an artifact to deliver software from point A to point B. Standardization and form factor really is what drove us there. Certainly there's value in that. The interesting moment happens when all of that together becomes...when the complexity of running all of that together and lining it all just up, right? In the beginning you had one IBM S390, and you'd know that's the one thing you have to line up. Now you have 200 OEM servers across X racks, and that's a lot of ducks to line up. The complexity and management of independent systems that you're sort of adding together, that sounds good on paper, but at some point you're crossing that complexity line where it's just more complex to even manage the hardware. This is not just from an effort perspective, this is also from a CPU load perspective. If more than 50% of your cores go towards just staying in sync with everybody else, how much are you really getting out of each individual component that makes up this cluster? Now of course you're saying, ""Well, how do I disrupt it?"" Well, you disrupt it by making assumptions about how this infrastructure actually looks like, rather than saying, ""Well, you're a drop in the ocean, you first have to figure out where you're even at."" If you eliminate that complexity, then fundamentally you can go straight into focusing more on a data plane-type focus rather than figuring out how the control plane looks like and how busy that one is. It's got a little bit of that. I think the DGX represents an optimization that shows...rather than purchasing 8 separate servers that have potentially similar GPUs in them, here's a way that not only has those 8 GPUs in them, but it also is interconnected in a way that just makes optimal assumptions about what's going on between those 2 GPUs and what could possibly run on them. That combined with a software stack that's optimized for this layout just brings the value home. That's really where we're coming from. Lukas: It's interesting. When I started doing machine learning, the hardware was pretty abstracted away. We would compete for computing resources, so I got a little bit handy with Unix and NICE-ing processes and just coordinating with other people in grad school. But I really had no sense of the underlying hardware. I don't even think I took any classes on networking or chip architecture, and now I really regret it. I feel like I'm actually learning more and more about it and the hardware is becoming less and less abstracted away every year. I think NVIDIA has a real role to play there. Do you think that over time, we'll go back to a more abstracted away hardware model and we'll figure out the right APIs to this? Or do you think that we're going to make more and more specialized hardware for the different things that people are likely going to want to do, and a core skill of an ML practitioner is going to need to be ""understanding how the underlying hardware works""? Stephan: Yeah. I think what you said there is...I'm reminded of 10 years ago, we used to say, ""Well, if you're a web frontend developer and you don't know TCP/IP, you're not really a web frontend developer,"" but most web frontend developers will never think about TCP/IP. I think this is very true here, too. You have an MLOps practitioner and today you get to think about your models and tensors, hyperparameter searches, and all of that kind of stuff, and yes, that's important. Well, not important, it's crucial. Without that you couldn't do your work. But, increasingly you also have to know where you're actually running, in order to get the performance that you need. Today it's a real competitive advantage for the companies out there to increase the training speed. Obviously what we're solving is just getting started. I mean, we take all that pain away, you just log onto Base Command, off you go. But increasingly it's a true competitive advantage. Not to be in the cloud, but to be training faster than anybody else. 2012, 2013, if you weren't working on a cloud initiative as a CIO, that was a problem. Now, increasingly, If you're not focusing on how to accelerate AI training, now you're putting your company at a disadvantage. That means that the necessity for each individual practitioner who interacts with the hardware to actually understand what they run on and how to optimize for this is going to increase. Having said that though, part of our job at NVIDIA, I think, is to make optimal choices on behalf of the practitioner out of the gate. Rather than requiring people to really understand, let's say, the clock rates of each individual bus or something like that, we'll abstract it away. People will argue that CUDA is already still pretty low level, but we're actually abstracting a whole lot to even get to that point. I would say while that's true, we're trying to shield the practitioner as much as possible. We have a leg up because we can work with both the knowledge of how the GPU looks like, and most importantly how the next GPU will look like, but also how to expose that optimally at the application layer and interact with the MLOps providers in a meaningful way that just is optimal throughout. Lukas: Have there been any kind of cultural changes needed to build a SaaS, customer-facing product like Base Command at a company that comes up through making really great semiconductors and very... I would call CUDA low-level from my vantage point. Obviously it's an amazing piece of software, but it's a very low-level software. Has NVIDIA needed to make adjustments in the product development process to make Base Command work for customers? Stephan: Yeah, it's interesting. Base Command is actually not a new product. We've been using this thing internally for over five years. It was a natural situation for us because...five years ago we launched the first DGX. Of course, if you launch something like the DGX, and you say that's the best thing you could possibly purchase for the purposes of training, and you have 2,600 AI researchers in house, then you can imagine the obvious next question is, ""Okay, how do we use this thing to accelerate our own AI research?"" This need for creating large-scale AI infrastructure on the basis of the DGXs was born right out of this situation. With that came all these issues and as we solved them, we just kept adding to this portal or to this..it's more than just a portal. I mean, it's the entire stack, it's the infrastructure provisioning, and then the exposure, the middleware, the scheduler, the entire thing. It became more and more obvious to us what should be done. These 2,600 researchers that I just mentioned, bless their heart, they really had to go through a lot of iteration with us and be very patient with us until we got it to the point where they'd, let's say, not complain as much. The point is that we really tried to get it right. We acted in a very transparent manner with a pretty large community of AI researchers and developers, and they told us what they needed and what they wanted and what their pain points were. Going to market now with Base Command as an externally facing product was simply turning that to the outside. Lukas: Have there been any surprises in taking it to market? I know that sometimes when companies have an internal tool, like I think the TensorFlow team has talked about this, that it's made for a really, really advanced large team and then you want to take it to someone who's newer, or a smaller team, they kind of have new needs that are a little bit surprising to people that have been doing this for a long time. Have you encountered anything like that as you bring it to market? Stephan: Yeah. It's funny you asked. We encounter this in just many different aspects. One example is that most customers...like I said, we make this available. The internal example that we use is, ""Oh, you get to drive the Lamborghini for a while,"" the idea is this is a short term rental. I mean, how long are you renting a Lamborghini? Maybe a day or two or a weekend. Here, we're saying short-term rental, they're probably going to rent this for three months or something like that. It turns out, most customers want to rent this for two years, three years. What surprised us was that there's a real need for, not only for a long-term rental, but especially the immediacy of access to this. I think we had underestimated a little bit how desperate the market was to get started right away. We knew that people would want to get started, but we always figured, ""Well, the cloud is there to get started right away, you just sign up and swipe your credit card and off you go."" The need for large-scale training and just the immediacy of that need, that personally was a surprise to me. I hadn't expected that. I thought that would be much more of a slower ramp than it was. I thought I was going to be in different sales conversations than I actually found myself in. That was a surprise. Other surprises are just understanding just how much people still have to go. Typically, we encounter folks who say, ""My way to scale and accelerate my training is just to pick a larger GPU."" There's a big, big portion of the market that certainly has been operating that way. But really helping them see that sometimes it's not scale-up model but the scale-out model that might be appropriate as the next step, it wasn't exactly surprising, but it was interesting to see just how widespread that scale-up thinking was rather than the scale-out thinking. Lukas: Can you say more about scale-up versus scale-out? What do you mean? What's the difference there? Stephan: If you think about cloud infrastructure, then a scale-up approach would be, ""You started with a medium instance and you go to an X-large,"" or something like that. You just choose more powerful resources to power the same exact hardware, but you don't really think about adding a second server, for example, and now spread the load across multiple instances. Here, it would be something similar. If you always think about saying, I choose to run this on a Volta-based system and now I have a Volta-based GPU. Now my way to make this faster is to go to an Ampere-based architecture GPU,"" that would be scaling up. Certainly, that's something that you want to do, but at some point, your pace and your need for accelerated training actually exceeds the cadence at which we can provide you the next fastest GPU. If you need to scale faster than that, and if that curve exceeds the other, then you're essentially in a situation where you have to say, ""Well, how about I take a second A100?"" Then I have a multi-GPU scenario, and let's just deal with that, and so on and so forth. The natural conclusion of that is, ""How about multi-node jobs where they're smack full of the latest and greatest GPUs, and then how many nodes can I spread my job across?"" If you do, I don't know, 5 billion parameters then yeah, you're going to have to do that. Then you're going to be pretty busy trying to organize a job across multiple sets of nodes. Lukas: Do you have any sense on how your customers view the trade-off of buying more GPUs, buying more hardware to make their models perform better? Are they really doing a clear ROI calculation? One of the things that we see at Weights & Biases is that it seems like our customers' use of GPU just expands to fit whatever capacity they actually have, which I'm sure is wonderful for NVIDIA, but you wonder if the day will come where people start to scrutinize that cost more carefully. Some people have pointed out that there's possibly even environmental impact from just monstrous training runs, or even a kind of a sad effect where no one can replicate the latest academic research if it only can be done at multi-million-dollar-scale compute. How do you think about that? Stephan: In the end, I think it's a pretty simple concept. If the competitive advantage for companies today is derived from being able to train faster and larger and better models, you're not speaking to the CFO anymore. You're speaking to the product teams. At that point, it just becomes a completely different conversation. The only interesting piece here is that traditionally, of course, data center infrastructure is a cost center, whereas now we're talking about turning it into value center. If you turn it into a value center, then you really don't have this problem. Of course we have extensive ROI conversations with our customers. We have TCO calculators and all that good stuff, it's definitely there. It's really about helping customers choose, ""Should we do more cloud for where we're at?"" and from a GPU standpoint, we're happy with either outcome. We're maintaining neutrality in that aspect that we're saying, ""Well, if more cloud usage turns out to be better for you, then you should absolutely go and do that."" Then if we figure out that the economics shifted in such a way that a mix of cloud and on-prem, or cloud and hosted resources makes sense, then we'll propose that. It's really about finding the best solution there and definitely our customers are asking these questions and making pretty hard calculations on that. But, I mean, it's pretty obvious. If you think about it...a couple years ago, we talked to an autonomous driving lab team and they said, ""Well, Company A put 300,000 miles autonomously on the road last year, and we put 70,000 miles on the road last year autonomously. We got to change that. How do I at least match the 300,000 miles a year that I can put autonomously on the road?"" So that's a direct function of, ""How well does your model work?"" and so on and so forth. It's a pretty clear tie-in right now. Lukas: What about inference? A lot of the customers that we talk to, inference is really the dominant compute costs that they have, so the training is actually much smaller than the spend on inference. Do you offer solutions for inference too? Could I use Base Command at inference time, or is it entirely training? And do people ever use these DGX machines for inference, or would that just be a crazy waste of an incredibly expensive resource? Stephan: Yes and no, it depends on how you use it. First of all, you can use Base Command for model validation purposes. You can have single-shot runs. But some customers want to set up a server that is dedicated to inference and then just take MIG slices and say, ""Well, I'll do my model validation at scale, basically. I'll do my scoring there."" If you share that infrastructure across a large number of data scientists, you put your DGX to a good use. There's no issue with that. We do have a sister SaaS offering to Base Command called Fleet Command. That is meant to take the output of Base Command in the form of a container, of course, and then deploy that at scale and orchestrate it at scale, and really manage the inference workloads at the edge for our customers. It's an end-to-end coverage there from a SaaS perspective. Lukas: In your view, based on the problems that you're seeing in the market, what functionality are customers asking for in their software layer for machine learning training that you're interested in providing? Stephan: That's a really good question because it goes through the heart of the question, ""What space is Base Command seeking to occupy in a theoretical stack where the infrastructure's at the bottom and something like Weights & Biases at the top?"" I would see Base Command's role as an arbiter and a broker. Almost like a bridge between a pure developer-focused, almost like an IDE, perspective and bridge that into enterprise-ready architecture. Let me give you a simple example. If you do dataset versioning — and then let's say that's what you want to do with your MLOps platform — then there's many ways to version data. You can try and be smart about this, but at the end of the day, it's a question of what infrastructure is available to you. If I have an optimized storage filer underneath, my dataset versioning strategy look entirely different than if I just have kind of a scale-out, open source storage backend. If I work with S3 buckets, then my versioning looks different than I do that with NFS shares. The value that Base Command provides is that it abstracts it away. If you do dataset versioning with Base Command, then it'll do snapshots. If you do it on a NetApp filer, it'll do other things than if you do it with a different storage. But those are exactly the questions that an enterprise architect will be interested in. How do you deal with that? Just because you figure you need 50 versions of your dataset that's 3TB large, does that mean I need to plan for almost infinite storage? No, it doesn't. We can help you translate that and make that consumable in the enterprise. i I think that's a big piece that I think Base Command can provide as this arbiter between the infrastructure and the API, if you will. The second thing is, increasingly, I've seen people being very concerned about data security and governance around this. If you have sufficiently large infrastructure to deal with, then almost always you have multiple GEOs to deal with. They have different laws about the data that's being allowed at any given point in time. Just the ability to say, ""This dataset can never leave France,"" or ""That dataset has to only be visible to these three people and nobody else,"" is of extreme value to enterprises. All those things come into play, and I think that's where Base Command can help. Lukas: Are there other parts of Base Command that you've put a lot of work into that people might not realize the amount of effort that it took, that might be invisible just to a customer, even just me even imagining what Base Command does? Stephan: Yeah. I think that we invested a lot in our scheduler. If you look at the layout of DGXs in a SuperPOD arrangement and the nature of the jobs that go into this, I think people underestimate just how optimized the scheduler is across, not just multiple nodes, but also within the node. For you to be able to say, ""I'm running a job and with a one-GPU configuration,"" and then it's a slider, and then I say, ""Well, I'm turning this into an eight-GPU job now,"" and that's literally a selection. What goes on in the background, it's just a lot more intricate than people typically realize. But it goes on automatically and you do have to be ready for it. You have to program for it, and people know that. But as soon as you do that at your layer or the optimization underneath, it's just incredible. Lukas: What's tricky, is it like you need to find eight GPUs that are close to each other and not being used and all that, is that the basic challenge? Stephan: Yeah, exactly. Data locality, caching strategies, all that kind of stuff is going straight into that selection. Lukas: Cool. All right. Well, we always end with two questions, both on ML. Let's see how you answer them. One thing we always ask is, what's an underrated aspect of machine learning that you think people should pay more attention to, or you would love to spend time on if you had more free time? Stephan: I think what's underrated is this aspect of crowdsourcing. I don't think anybody is looking at machine learning and the potential that just many small devices that are contributory to the creation of a model would bring. I think that we're at the cusp of that, but we're not really doing that right now. I think to the degree that it already happens, it's very hidden from us. We all know Google will run some algorithms across data that was collected through the phones. We understand that on a conceptual level, but just the ability to bring that together in a more natural sense that we might want to find recommendations not on the basis of a single parameter, but find recommendations of more meaningful parameters. I find five-star reviews very meaningless, for example. I think that is a very simplified view of the world. I find, consequently, also one-star reviews very meaningless. But if you could actually have a more natural understanding based on machine learning, that would be an interesting topic to explore, because it would have to be based on just all kinds of inputs that would have to be taken into account. I would like to see that and I think that would be an interesting field of research, an interesting field of development. I think people still assume that it's only a prerogative of the big companies to be able to do that, but I think there's an open source project in there somewhere. Lukas: Cool. I hope somebody starts that, and then they should send it to us when they do. Lukas: Our final question is, when you look at your customers and their effort to take business problems and turn them into machine learning problems, then deploy them, and solve those problems, where do you see the biggest bottleneck? Where are they struggling the most right now? Stephan: The biggest issue they have — at least as far as I can tell — is that they have just a getting started issue in the sense of, ""How do I scale this beyond my initial POC?"" I think that the prefab solutions that are out there are pretty good at walking you through a getting started tutorial and then they'll probably gets you really far if you're a serious practitioner and you devote some time to it, but I think that at some point, you'll hit problems that may not even have anything to do with ML. They may just have something to do with infrastructure that's available to you and things like that. I think that anybody who is trying to use this for a commercial and a business strategic purpose is going to run into an issue sooner or later of, ""How do I go from Point A to Point B here? People call it something like AI DevOps, or something like that that floated around. I think, as an industry, we should be aiming to make sure that that job never comes and sees the light of day. Lukas: Too late, I think. Stephan: Yeah. I feel like we lost on that one already. But I really think, we should do better. You shouldn't have to require super special skills to create this whole DevOps approach around AI training. We should really know better by now how that whole approach works and then build products that drive that. Lukas: Awesome. Well, thanks so much for your time. I really appreciate it. That was fun. Stephan: Thank you. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out.",7331 +Chris Padwick — Smart Machines for More Sustainable Farming,https://www.youtube.com/watch?v=KNrwpq1uJhA,3659,2021-12-23,"Chris: Least what I saw was that people's workflows kind of shifted when they realized that they could use this a little bit more like a scalpel instead of a sledgehammer. If you've got the ability to go in and do sort of targeted delivery to only the things that you care about, then that changes your workflow. Lukas: You're listening to Gradient Dissent, a show about machine and learning in the real world. And I'm your host Lukas Biewald. This is a conversation with Chris Padwick who is Director of Computer Vision and Machine Learning at Blue River, which is a company acquired by John Deere that helps farmers strategically spray pesticides and herbicides to help the environment and help their customers. It's a super cool application of machine learning, because it's so concrete and so unambiguously great for the world, but also super difficult. They run into kind of every edge case you can possibly imagine. It's really fun to talk to someone that's spent so much time working on a single hard problem. Lukas: Chris, thanks for doing this. Chris: Yeah, absolutely. Thanks for having me, Lukas. This is going to be a lot of fun. Lukas: Awesome. You work on some of my favorite ML applications. I'd love to maybe start with an explanation of what you're working on at John Deere. Chris: Yeah, absolutely. One of our products that we're working on with John Deere is called a See & Spray, and the idea is really interesting. If you look at how a farmer does their workflow right now, there's a bunch of parts to farming. If you think of it as a software stack, you can describe it as a tillage portion of that, where you're preparing the soil, then you're planting the soil, then you're weeding the soil, or you're weeding the plants and you're harvesting. There's kind of like those four sections. See & Spray targets the weeding section, where what we've done is taken an existing sprayer that folks have, and these are huge machines actually. They have 120-foot spray boom and they're capable of upwards of 800 acres a day in weeding. These things are — I'm 6'2"" and this machine, when I stand beside the wheel, the wheel is taller than my head — so, this is a giant, giant machine. It can do tremendous amounts of productivity, is what people buy this for. The application that we're targeting is instead of spraying your entire field to kill the weeds, you really only need to spray the weeds. And that's what we built. So, we built a computer vision system with AI and specifically deep learning that does discrimination between crop and weed, and then a robotic system around that, that only targets the weeds. Lukas: Is this something that gets pulled behind a tractor? What does this machine actually look like? Chris: Yeah, this is an existing product called a self-propelled sprayer. It's really a purpose-built device. I'll do a little impression of it for you, I should have brought a little toy. But what happens is, it has these spray booms, and I'm actually not flexible enough to do it, but imagine my elbows are pointing exactly forward. When we unfurl the boom, these booms go out and they're 60 feet on each side. We have 98 nozzles actually spaced throughout the boom. As the farmer goes through the field, what happens is the cameras — we have cameras basically spaced roughly every meter — cameras take pictures as you go through the field. Those pictures go into a machine learning algorithm that we've trained to distinguish a crop from weed. That's a convolutional neural network using deep learning. Then we map those to individual sprayer locations and then try to only spray the weed as we're going over it. Lukas: Wow, so you have like, did you say you have like 120 cameras then that unfurl or something like that? Am I doing the math right? Chris: It's a 120-foot boom. And on one of our configurations we've got 36 cameras. Lukas: Oh, one every meter I see. Chris: Yeah, that's right. Lukas: Gotcha. Chris: Yeah. Lukas: But then the sprayers are actually at a smaller interval than that. Is each one over one row of crops or are there multiple...how does that work? Chris: Yeah, that's right. So, there's different kind of configurations. It's kind of dictated by what the farmer's doing with the machine. We do row crops right now. So, three target crops, there's soybean and cotton and corn. Depending where you are in the US, you might grow...if you're in Texas, you might grow cotton at 40-inch rows. In which case you'd sort of take a 20-inch machine where you'd have a sprayer every 20 inches. If you imagine the two rows and they're 40 inches apart, you've got a sprayer there and then a sprayer on the row. The other common one is 30 inch, which is found a little bit more in the Midwest rather than the south. Lukas: The cameras are pointed down at the crops and I guess the cameras are in front of the sprayers. So, they know when the weeds are coming. Chris: Yeah, yeah, that's right. Our geometry is that we actually have a camera that's tilted forward so that we can see the crops and then react to them. We did try to...the most ideal geometry in a side view is if you have something sticking out and then the camera pointing straight down. But unfortunately, anything that sticks out of the boom is something that will probably get torn off when it hits something. I don't know if you've ever driven something that has like a 60-foot thing sticking out of it on each side. It's really, really hard not to hit stuff. Actually, that was one of our design constraints. We really couldn't...we could put things that come up from the boom, because those are safer, but we couldn't put things that go out from the boom, because that would be a mechanical risk that we would just end up breaking parts and breaking the sprayer. Lukas: The benefit here to the farmers is that they use is less, I guess, what do you even call the thing that kills the weeds? Like herbicide? Chris: Yeah. Herbicide. Yeah. Yeah. That's one of the big benefits. I'll put this problem into perspective for you and try to give you a little bit of an understanding as to why farmers so excited by this product. Let's take an example and say that you're a 5,000-acre cotton farmer. So, you've got 5,000 acres total, maybe spread over multiple fields. The amount of money that you're going to spend just on herbicides — we actually call this the inputs and so your inputs are fuel for your sprayer. Maybe you have to buy a new sprayer, that could be an input. Depreciation costs that could be an input. But seed costs and also herbicide are one of your really main cost drivers — just in terms of the herbicide, you could be spending up to...it wouldn't be uncommon to spend $150,000 spraying herbicide on your field. So, if I come to you with a tool like this, the See & Spray system, which is more of a targeted system and say, ""Hey, depending on your weed pressure and depending on your farming practices, depending on a few input variables, we actually might be able to save you a lot of money there."" Maybe you only have to spray 50% or maybe 70% or maybe 30%. It depends on a lot of factors, but if I could put a dent in that herbicide cost for you, then that's a really interesting proposition, because that's money that you can reinvest in your farm. You can do something else with that money rather than spray herbicide. That's one of the biggest cost drivers for the farmer. The second kind of cost driver or the second driver is wanting to be more sustainable, like participate in more sustainable farming practices. That's on a lot of farmers' minds of trying to spray less herbicide and be better stewards of the land. This product goes directly into that use case. So, folks are very excited by those two things. That's the main value propositions for the product. Lukas: That's super cool. Does it have to then do the inference on computers that live on this device? Do you actually put a whole bunch of computers to process all these images? How does that work? Chris: Yeah, that's exactly right. We've built some custom electronics with John Deere that will survive in the agricultural environment. It's sort of funny. When we think about, ""Okay, where would you put a cluster of computers?"" Probably your last choice would be in Texas in 115 degree Fahrenheit heat with lots of dust. That's kind of like not the thing that comes to mind when you're building clusters of computers, but that's actually exactly what we've done. In the current design, we have these compute units on a machine and they function in literally the worst environment you can imagine. It takes a lot of engineering to make that work. But yeah, when we first kind of described this workflow to people, they sort of think, ""Oh, so you're pushing stuff up to the cloud and doing the inference and sending the result back,"" and actually we're doing it all on board. Just like an auto driving car, because we just don't have the time, we don't have the latency to be able to do that. We need to make decisions...as soon as we see the weed, we have to react within milliseconds. So, it's all done on the platform. Lukas: Is this a hard vision problem? Just to get the accuracy to what you need, is that a big challenge for you? Chris: Yeah, it really has been a hard vision problem. There's sort of two types of products that we talk about. One of them is called a green-on-brown product and that's out in the market today. It's marketed as See & Spray Select. That's a computer vision-based AI system that is capable of spraying weeds that are in the furrow. If you think of a row crop, you've got like the two rows here and the furrow in the middle. So, this product can spray weeds that are in the furrow. The next level of that product is the green-on-green product, where now you can say, ""Okay, well, I can tell the difference between a weed and a crop, even if it's in the furrow,"" because your weed can then kind of grow anywhere. It's not only constrained to...anyone who has a garden, you know that weeds will grow wherever they want and they're not constrained to just grow in the furrow. So, the green-on-green product is sort of the next level of the capability. And it's a really hard vision problem. One of the things that makes it really hard is that it's tough to get labels that are correct for these. I don't know about you, but it's tough for me to tell the difference between say a Pigweed and a cotton plant that's a certain size or a Velvet Leaf or a Morning Glory. These are weeds that look lot like maybe a cotton plant or maybe a soybean plant at a certain time of its life. In order for our product to be successful, our labels on that data have to be correct. What we've kind of found through trial and error is that it's pretty easy to find people that will label images for you. But it's actually really, really tough to find people that know the difference between these kinds of weeds and these crops. That's been kind of our main challenge on this project, assembling what I would call an expert workforce of agronomists, actually, who have — some of them actually have PhD in weed science — and these folks help us develop training materials and help us tell the difference between these different varieties of crop and weed. That's really, really important as we look at our pipeline and our stack. The thing that we spend probably the most time on is talking about label quality and how to improve it and how to measure it. So, yeah, it's a huge topic for us. Lukas: Is that partially because...have weeds evolved to look like plants so that human farmers don't pull them up? Chris: There's some truth to that. I'm going to repeat a story that my agronomist told me, that there's a weed that's called a mimicry weed in rice. What happened is when people started hand weeding rice, all of a sudden, it was sort of a selective process where things that didn't look like rice got weeded out, but things that did look like rice started to survive. So this mimic weed kind of evolved and it sort of came to the point that it looked so much like rice while it was growing, that a trained person in the field had a roughly 70% chance of telling whether it was rice or not. That's how good the mimicry had gotten. And we definitely do see that in our models too. We don't officially call it the FBI's most wanted list, but maybe we should call it that. But we do have some weeds that are much more challenging to differentiate. They look a lot like...sometimes we have arguments on the CVML team, like, ""Okay, well, what do we think this is?"" And some of these cases are pretty ambiguous. Lukas: If you know where you planted the plants though, couldn't you say that any plant that's sort of not in the place where you think you planted a plant is a weed or should be sprayed? Why do you need to identify exactly what kind of plant it is? Chris: Oh yeah. That's a really good question. There's kind of two answers to that question. So, answer number one is that you certainly can...you do have that information, especially if you're using a John Deere planting stack. If you've got John Deere machines at every part of that stack, then you've got information about everything you've done. Specifically with ExactEmerge, it's a technology for planting where you can actually tell precisely where the seeds were planted. Now, the thing that you don't know is what's emerged. So, you do know like what was planted, but you don't really have a good sense of what's emerged. That's one reason that you do have to do kind of a vision-based approach to this. Lukas: I see. Chris: The second reason is actually a really kind of a subtle one. When we're talking about herbicides...it's funny, when I started at Blue River, I thought that, ""Oh, killing plants is easy. You just spray herbicide on them and they die."" And it's actually very, very complicated. One of the things that's most interesting here is that you've kind of got two species — rough kind of breakdowns of plants — you've got broad leaf and grasses. You actually use different herbicides to go after broad leaf versus grasses. If I can tell you with my machine learning model that, ""Okay, this is a weed and it's a broad leaf weed,"" then you're going to put something different in your tank mix to actually attack that weed. Similarly, if you have a grass weed...if you try to spray a broad leaf herbicide on your grass weed, it's not going to do anything. There's the opportunity there for more savings for the customer and more effective weed control by identifying roughly, "" Are we dealing with broad leaf or grass?"" and the See & Spray is targeting the herbicides directly to the plant that needs them. Lukas: I see. That's very cool. So, how deployed is this? If I went to fields in Texas, would I see this device in use? Chris: Yeah. If you had been in our field season...I guess our field season's still technically going. We're into what we call ""fallow"" operations right now, which is basically, ""Identify any plants that are in a fallow field and spray them."" But between the months of roughly March to say August, that was kind of our main weeding season for soy and cotton and corn. And if you had come to Texas and the Midwest and that area of the US, then yeah, you would've seen this system deployed. We did something really...kind of a first for Blue River this past year. In previous years, what we'd done is we'd built a machine and then we'd taken it to a grower's field, gotten a cooperating grower, and then we'd operated the machine and kind of done demos to get product feedback. That worked really well, but it's not the same as a customer actually the operating machine. This year, what we did is we actually handed the keys over — this brand new sprayer that's right off the line, has all the bells and whistles and a bunch of brand new technology — we handed them to growers and gave them a little lesson on how to run it and said, ""You actually run this machine and we're just going to sit on the road and kind of watch you do it, and we're not going to interfere with you."" We actually did give customers the ability to run the machine, and the learnings from that were really great. We're actually just sort of still compiling the learnings and bubbling the biggest things that we want to work on up to the top so we can hit the ground running again next year. Lukas: Wow. Were there any surprises when you did that? Chris: Yeah. Yeah. Big time. There was one that I think is really funny. So, on CVML, like computer vision machine learning, we tend to look at the world in a certain way. And that way is like, ""Okay, the most important thing for the machine to do is identify the weed and then spray the weed."" There's a pretty good reason for that, because what happens to a small weed? Well, it turns into a big weed and that becomes a problem. So, hitting weeds when they're small is something that we think we need to work on. We worked on that problem very diligently, and we've got a solution that definitely targets the smallest weeds. We do this with a little sensitivity knob on the model. What it's doing there is thresholding the focal loss in our network. We're kind of thresholding that value and then making a decision based on that threshold setting. As a user, you could target it. So, you could say, ""Okay, I want it to go really sensitive and target the smallest of the small weeds, or I can go less sensitive and only kind of care about the big weeds."" What we found our customers doing was kind of using this in ways that we hadn't envisioned. One of those ways that was a total surprise...I spent a bunch of time out in the field this summer working with customers and observing their workflows, and when I came back and said, ""Yeah, one of the favorite things to do is to set this thing to be really low sensitivity and then go after only the biggest weeds,"" and my team's heads explode. They're like, ""They're doing it wrong. That's wrong, they're missing weeds."" It was an interesting one, because as the farmers explained it to me, it made a lot more sense. What happens with row crops is when you have...just kind of picture a bare field and then picture putting these crops in rows and then picture the crop emerging. In that time, when you're actually growing a soybean crop, you do want to target all the weeds, because what will happen is those weeds will compete with your crop and they'll compete for nutrients and it'll reduce your yield. Going pretty aggressively after those weeds in the early — we talk about pre and post. Pre-applications are pre-planting and post are post planting So, the first time you get in after planting is really called your first post pass, if you will — it makes a lot of sense to be aggressive at that early time. But as the crop starts to grow, what happens is that the crop — if you're successful — the crop will grow faster than the weeds. Because of the spacing of which they're planted, they'll actually start to canopy over. At that point, then they've actually won largely. This is not a generally true statement, but it's almost generally true, that largely your crop has won when it's canopied over. Lukas: Interesting. Do you have computer vision people on your team that have deep knowledge about farming? I don't think...that might be a low overlap set of knowledge. Chris: Yeah, it definitely is. I guess that I have the most. I grew up in rural Saskatchewan and we had a small quarter section farm, and I used to ride horses and stuff when I was a kid. I think I've probably got the strongest farming background on the computer vision machine learning team. But what we tell people is that, hey, you don't have to know a lot about farming to come and work for us. And what you do have to do though, is not be afraid to go out to the field, because we believe that's where we learn the most. That's been part of kind of Blue River's DNA, I think, for forever essentially. We think, okay, you could go and talk about stuff on the whiteboard and you definitely should do that, but you need to reduce that idea to practice and get into the field as fast as possible so that you can learn.You can blow up your assumptions basically. That's one of our guiding principles, I guess, at Blue River. One of my friends...we don't like to hire house cats. What that means is like, if you're the sort of person that just likes to kind of sit in their office and work on their problem and not go out to the field, then this probably may not be the greatest place for you. So yeah, no farming knowledge required. Lukas: Nice. But I like it. A good customer empathy, I feel the same way with the Weights & Biases engineering team. Have there been any other surprises when you've taken these devices into the fields? Chris: Yeah, there was actually. If you're sitting in the cab, the cab of these machines are just absolutely fantastic. You hop in the cab, and you're sitting in this chair that feels like a fighter jet. Part of the allure of the fighter jet is you've got this joystick that has all these buttons on it and the self-propelled sprayer's exactly the same. Without a word of a lie, there's like on the order of like 24, 25 buttons on this joystick. They all do something different. It's really fun to get in this thing and like, wow this is really, really cool. We also have a display that you can see. There's a couple of displays. There's one that's kind of sitting here and then one that's kind of up more at your eye level. So, there's two displays that you can look at and you can control the system through the displays. When we launched the product for customers, we thought that the driving factor would be killing weeds. We said, ""Hey, everybody, all the feedback we've heard is that people want to control their weeds. And that's the most important thing. Savings would be kind of second on that list."" That's kind of how we came into the season. Certainly some farmers are like that. And I think initially when they started the machine, when they started using the machine, that was their first concern. Like, ""Okay, I'm going to go with high sensitivity. I'm going to kill all my weeds so I get the same weed control as broadcast. And then any savings I get on that are going to be a bonus, but I'm not actually going after savings."" At least what I saw was that people's workflows kind of shifted when they realized that they could use this a little bit more like a scalpel instead of a sledgehammer. If you've got the ability to go in and do sort of targeted delivery to only the things that you care about, then that changes your workflow. What I saw was that was most interesting is — back to this display —as you're going through the field, it has something we call the applied rate map. It's basically...it's a geospatial map and it shows you what the boom is doing in real time. You can actually see the sprays laid down on this map as you're going over it. It's like a real time measurement of weed pressure, if you will. I think what I was surprised by was that customers usually don't look at that the display, because it's really boring. If you're just doing a broadcast application, then the applied rate is always the same. It's not really an interesting map other than, ""Is a sprayer on?"" Or not. But with See & Spray, it's actually a really interesting map, because you can see the patches coming down on each individual spray nozzle. What I saw that was really cool was growers were looking at that map and then they were looking outside and saying, ""Oh yeah, that makes sense. I know I have more weeds in this area of the field,"" and then they get through another area where it wasn't spraying as much, and ""Oh that makes sense. I know that this area isn't this wet, so I don't have as many weeds."" That was really interesting, because their eyes were just kind of glued to this real time mechanism to see, ""What's my sprayer doing and what's my weed pressure like?"" That was really cool to see folks using that. Lukas: How much have your models improved? It sounds like they're over a threshold where it's useful. Do you still feel like there's a ways to go in terms of the quality of detection? Chris: Yeah, that's a great question. So, the models have improved dramatically year over year. Lukas: Wow. And that's mainly due to better data labeling? Chris: Yeah. There's kind of two parts to that. Getting smarter with our labels and labeling with a better workforce, which we talked about. Also being more targeted in what we label and really kind of preferring quality over quantity. When we got into this, a few years ago, we always had in the back of our mind, quantity is really important and specifically diversity is important. The way that we approach this collection of diversity is to try to collect data in every kind of growing condition we can get our hands on. It's interesting to see how much different the ground looks. When we're kind of talking about soybeans, we might have a picture in our minds of a soybean on dark soil and a really pristine kind of computer vision environment that you could train a model on in 10 minutes and do something. It turns out not to be true at all. There are so many confounding factors. One sort of visually confounding factor that's really interesting is, folks are really into no till planting. No till planting is exactly what it sounds like. You just sort of don't do the tillage step and you keep the cover crop that was there last year. So, let's say you're rotating corn and soybean. You might grow corn this year and then next year you plant soybeans, but you don't actually till it under. You just run your planter through the old dead corn. You have all these stalks sticking up and they're all dead, and then you've got some plants that are emerging that are alive and then you've got weeds. It's almost like the most confusing computer vision environment you could possibly imagine. That makes it really hard. We've been working really hard on beefing our models up to work in these different situations. Another really good one is just the soil color and the farming practices. In countries like Brazil, they actually don't really plant on 30-inch rows. They plant sometimes much, much denser than that. That means that you can't really put a sprayer between the rows anymore. So, they actually drive like 45 degrees across the row and kind of kill plants or run over plants with the sprayer. Another situation that our model has to handle is these different farming practices in different regions. Lukas: I guess, as an aside, but how do they then do other... don't they always have to drive some machine over their fields? Why would they put this within...why would they put them so close together that they can't drive machines over the fields without squashing plants? Chris: It's a little bit of a mix of different types of machines. Lukas: Oh I see. Chris: Yeah. Sometimes what we see is that...the farming practices kind of in the US, someone described this to me as like, ""How would you build a factory?"" Well, what you do is you'd mechanize every part of the operation and you'd build machines that do this over and over and over again. You can think of farming the same way, except basically the factory goes to the plants, not the other way around. Excluding kind of vertical farming, which is a different thing. When you have sort of the John Deere, if you have a full John Deere stack for all this, then you don't really have this problem. You can plant with a compatible spacing with your planter and your weeder and your harvester, all sort of compatible. But when you mix and match and you don't have that kind of end-to-end solution. You sort of do end up into an interesting area where you've got to make some decisions. Brazil is a special climate too. It's a lot more humid and they actually kind of grow year round. It's a really interesting one. Soybeans in that kind of environment love to be way dense. So, it's kind of an interesting one, but yeah, that's one of the challenges we have with computer vision of teaching your model to identify weeds really at any kind of orientation. Lukas: Yeah. I'm seeing why this is a harder problem than I was imagining. These are really evocative examples. Lukas: Does it ever happen that your model has an escape valve or something, or it has some sense that, ""You know what? This is too hard. Red alert, I don't want to touch any plants""? Chris: Yeah. We have exactly this system. It's really, in a nutshell, how we respond to dust. So, you can imagine that dust is a really complicated environment, because there's really two things working against you. Number one, it's almost impossible to get good quality labels on a dusty image. If people can't tell what's in the image, then it's going to be really hard to get a machine to do that too. So, labels are tough to do. But the other thing that I like to talk about is these models often — and there's research, I think, to back this up. I'm trying to remember the name of the paper that came out, but I think it's called like confidently incorrect — but these machine learning models can be very confidently incorrect. A really stupid example is if we have an elephant versus giraffe classifier and you show it a rhino, it's going to be confidently incorrect, just kind of by definition. What we've done is we've actually trained...we sort of have an architecture that's a little bit like the Tesla model, where we've got this backbone and then these heads that do different things. We have this image quality head that tells us our dust probability, and we've trained it to really detect the presence of dust. Once that's above a certain threshold, then we say, ""Ah, probably the results from the model are not to be trusted in this scenario. The model might be confidently incorrect."" And then what we do is — in our system it's pretty easy, we have what's called a fallback — so we fall back to broadcast. The idea is if you're not sure what it is in your model, you don't trust the results of the model because it's dusty, then you just turn the sprayer on and that way you make sure you're not missing weeds. Lukas: Oh, interesting. So, you're really doing multitask learning. Is that right? Chris: Yeah, that's right. We have a few other kind of image quality related heads that's in our architecture and we're always kind of thinking about adding more as we discover situations that we need to detect. A good example is implement occlusion. If part of the implement gets into the camera frame, we want to be able to detect that. Doing that as a kind of a detection head off of an encoder backbone is a really very efficient way to do that, because you don't pay much of a runtime. I think we pay — I don't know, the numbers aren't quite in my head — but I think we pay like much less than a millisecond run time for doing a dust classification. Lukas: How frequently do you update these models? Can you update the model on one of these devices? Would you do that? Chris: Yeah, we do do that. It's going to be up to the farmer, ultimately, how often they update. Really what's going to dictate that is their connectivity. Some of the farmers that we worked with had really good connectivity and we could push updates to the machine very frequently if we wanted to. Other farmers that we worked with, they're in very remote areas of sort of west Texas and there are no bars. You have to go 20, 30 miles away before you get to one 4G bar. I think what's going to happen is we're going to see that the folks that have connectivity are going to be updating more often and they're going to be getting kind of the latest and the greatest models. The folks that are not are going to be just updating much, much less often. Lukas: Are you able to use the data that the cameras collect in the fields to improve the models? Chris: Yeah, yeah, we do that. Just like every other company that has a bunch of sensors, you quickly figure out like, okay, well we can't record all the time. Our solution to this is that we do some sort of very sparse triggered recording, as folks are going through the field. Within minutes of them going through field, subject to the connectivity constraint that I talked about, the machines are uploading data so that we can validate the performance of that data. We like to think of this ML flywheel concept, where...Andrew Ng calls it the virtuous cycle of AI, where you train a model, it does some predictions, you evaluate those predictions, train more models, and your model gets better and better. We have folks sign a data use agreement that allows us to do this, but this data is a gold mine for us in terms of finding edge conditions in the model. When you're training machine learning models, you quickly start breaking champagne after you get X number of images in a model, because you're sort of getting to this point of diminishing returns pretty fast. But that doesn't mean your model's working well. That means your model is generally okay. But it might actually really suck in some situations. So what our metric is, is how many fields are you actually passing our spec on. Instead of aggregating all fields together into a single number we break them out on a per field basis. And our goal is to always drive that number higher. It really comes into exception management, where you very quickly reach this point where the model's actually doing pretty well, but then it has these notable failures and now your attention shifts to detecting and then addressing these failures. That's what the ultra sparse logging helps us to do, because we can really just get the data from the customers in the fields they're actually trying to work on and we can improve their models. Lukas: You guys are really doing a hard real world application of ML on massive data sizes and you've been doing it for long enough to really be battle hardened, I guess. I'm really dying to know, what does your ML stack look like? How did you get to it? How did you decide on your ML framework? What are the other tools that are really important to you? Can you talk a little bit about that? Chris: Oh yeah, sure. It's been a really kind of a fun journey. Going back to 2016, we were training our first models in Caffe, because that was the...nobody's heard of Caffe these days, but boy, that was like the tool that everybody used back in 2016. It was way farther ahead than say TensorFlow at the time. We started with Caffe and we built a system out in Caffe and then we saw some interesting things going with TensorFlow and I suspect this will be kind of a familiar story to hopefully a lot of your listeners. We moved to TensorFlow in sort of 2018 and then we saw, hey, PyTorch is getting really interesting. So, we moved to PyTorch about a year ago. We've been doing most of our work in PyTorch now. Lukas: What caused you to move from TensorFlow to PyTorch? Was it a feature that mattered? What was the driving reason? Chris: It was really adoption. So, we found...our experience with TensorFlow was really before the eager interface. So yeah, prior to eager. Lukas: So, this is more than a year ago probably. Chris: Yeah. Yeah, I think it was like 2018 when we were doing PyTorch. It was really before the eager interface and our folks were finding...it follows the papers a little bit too. When folks write a lot of papers and if you go to Papers With Code and you can download the PyTorch thing, that's just a much lower startup cost. So, to some sense it follows that. Lukas: Interesting. What else? Do you guys do hyperparameter optimization? Is that important to you? Do you use like a data store for this? What are you using? Chris: Yeah, so in terms of our training stack, PyTorch is kind of our main tool. We have done a little bit of hyperparameter search. One of the things that's a big challenge for us is that at some level we're stuck between this rock and a hard place. The rock is the accuracy — we want the highest accuracy —but the hard place is also, we need to make this run as fast as possible. In our system, the speed at which the farmer could drive the machine is directly gated by how fast our inference goes. We have to take great pains to try to pick and engineer a network architecture that's going to satisfy our accuracy, but run in the time constraints that we need. That actually dictates a lot of our network architecture choices. In terms of our kind of deployment stack, what we...oh, so back to hyper parameters, what that means is that we actually don't have a tremendous number of hyperparameters to search for. It's really just sort of learning rate, learning rate scheduler, and a handful of other things. We're sort of, by definition...our search space on architecture isn't very large. Some of the things that you would normally do with hyperparameters, we haven't found them to be all that useful for our problem. Lukas: This just came to me, but it seems like a self-driving car is hard, but a self-driving tractor might be pretty easy, especially if someone's just looking at a console instead of the field. Why does a human have to drive the tractor if you can do these really smart things to figure out where the weeds are? Chris: Yeah. You're right on the money there, Lukas, as usual. If you think about, what are the challenges of an auto driving car, they're really large. Just to pick one example, let's say you're cruising along the freeway and you're going a hundred kilometers an hour — I'm from Canada. So, that's why I think of kilometers — you're going a hundred kilometers an hour and then something happens and you need to slam on the brakes. Is that the right thing to do on the freeway? The answer is, well, maybe, I don't know. It depends what the problem is. With tractor automation, it's actually fine, because your field isn't full of tractors going a hundred kilometers an hour, and nobody's going to side swipe you with another tractor. If you see something weird, you could just stop and it's totally fine. You're right on the money. And it's geofenced, so you could say, okay, well, I'm just going to allow myself to be automated in this area. And there's a lot of it. Yeah, definitely a lot of advantages in terms of our problem space for automation. Huge number of advantages. Lukas: Is that something that you might work on or... Chris: You did see that John Deere acquired Bear Flag recently. So, I think you can probably guess that John Deere is very interested in this. I think some more information will be coming. Lukas: Awesome. Are there any other farming applications that you're really excited about? Chris: Yeah. We view weeding as almost the beachhead project. I've got a background in astronomy, so this metaphor kind of makes sense to me. Like, when you build a new telescope that has maybe different modalities or different resolution, all of a sudden you get this data from this new telescope, and then you start answering questions you never even thought to ask before. I kind of feel like the same thing's going to happen with our system. The fact that you have cameras that are three feet away from plants taking pictures at a high rate of speed and high resolution pictures is really going to open our eyes and change the things that we're doing with the data. You can actually see the weeds evolve as you spray them. You can go in today, do your sprays. You can look at that map and say, ""Okay where did I actually spray the most? And which weeds am I concerned about?"" And you can plan your next workflow based on that data. Lukas: Cool. Well, we always end with two open ended questions and one doesn't necessarily even need to be about farming. It's kind of your take on what's a topic in machine learning that you think is underappreciated or something that you'd like to dig more into if you had more time in your life. Chris: Oh yeah. I did put a little bit of thought on this one. I think a really underrated aspect of machine learning, there was a paper in, I think it was 2013, it was called Technical Debt in Machine Learning Systems. Lukas: Oh yeah. Chris: It was by Google. Lukas: Google. Chris: Yeah. You know this paper. Yeah. Lukas: Yeah. Chris: The thing that really jumped to me on this paper is, when you think about the whole machine learning pipeline, sort of like 20% of it is the cool, sexy stuff that everybody wants to do. Network design and hyperparameters and all that fun stuff, training on thousands of GPUs. That's all fun and very necessary, but it's about sort of 20% of the problem. Really, if you don't have a really strong data infrastructure around that, then you actually can't do that part, that 20% part. So, you can't make your product. Having an excellent data pipeline, I think is kind of underrated. It's something that everybody, I think underestimates when they get into machine learning. It's like, ""Okay, all I got to do is take this TensorFlow tutorial and do MNIST and okay, awesome. Now it's just like fine tune a model and I'm off to the races."" If you've got an awesome date of pipeline, then that's kind of largely true. But if you don't, then you got to focus your efforts on building that data pipeline, because that's really probably the most underrated problem in building ML systems in my mind. Lukas: Do you have any specific suggestions for someone putting together a data pipeline from approach to even what software they might look at first? Chris: Yeah, that's a great question. I'll kind of draw parallel to labeling. In 2016 we wanted to do labeling on images and we couldn't really find what we wanted. We kind of had to build our own system based on MTurk to do that. That made sense at the time, because we couldn't sort of look at the marketplace and pick the thing that looked good and then go with it. Then in 2018, we said, ""Okay, well, this exists now, so we could just do this."" And that's kind of how we met you with your old company. What I'm seeing is the startup world is a fire with MLOps solutions. Folks have figured out, wow, these problems are hard and they're hairy and they're nasty. I would encourage folks when they're looking at data pipeline to survey the space. There's new offerings every day out there. I think you'd be doing yourself a disservice if you didn't at least evaluate them before you embark on your data pipeline journey. I don't know that there's a solution out there that's going to do everything that you want, but there could be something that gets you pretty far, and then you can add your own plugins or something to that effect. But I would say if you're sort of like clean slate, how do I get started? Well, you should definitely look at some of the software packages that are out there, because this is not really a problem that you necessarily want to solve, unless you absolutely have to. Lukas: Well, as someone that makes one of these software packages, I strongly agree with that. Chris: Yeah. I can imagine. Lukas: I guess that's a good segue to our final question, which is...you guys have had one of the the longest journeys of imagining a solution and then actually getting it working in production. I mean, what's it been like? Eight or nine years maybe, for that full cycle. Where have the biggest bottlenecks or the most surprising bottlenecks been? Chris: Yeah, that's a really, that's a deep one for sure. Blue River's kind of first computer vision product was the lettuce thinning machine. That operated from 2013 to 2016 in the Salinas valley in Yuma, Arizona. And then we kind of pivoted to do row crops. So, it's kind of when I got involved in the company, is 2016 working on row crops. Yeah, our journey has kind of been from 2016 to roughly now. The things that are really, I think, hard is when you're talking about building a piece of hardware and a piece of software and ML that work together, that's a really, really hard thing to do. I think there's definitely a huge difference between kind of building...being really scrappy and building a prototype that is going to work out in the field and get customer feedback. That is a tremendously different proposition from scaling these things out to thousands of machines. I know with John Deere, that's a huge jump. I think that's kind of like scaling. I was looking at a book, just yesterday, written by Reid Nelson. I think the title was ""Scaling"". Achieving scale is really a big challenge. I think it's even more complicated when you've got kind of hardware in the loop, because now instead of just having the software portion or just the ML portion, now you've actually got hardware which necessarily takes a much longer cycle time to improve or fix. I think the biggest challenge is marrying the two, software and the hardware together, and getting something that drives customer value. That's been the biggest challenge for sure. Lukas: Well, congratulations on making such an amazing product that helps farmers and helps the world. It's great to talk to you about it. Chris: Yeah. Thanks a lot. We definitely are excited by this product. What I'm personally really excited about is, as we scale this thing out to tens of machines and hundreds of machines and thousands of machines, those savings are going to go up proportionally. I've been talking to John Deere about this. I want to get a ""Gallons of herbicide saved"" on the John Deere website. And it's going to just like keep increasing, going like to this really big number. It's going to be like the national debt, but it'll be like a good number. Lukas: Awesome. Well, let me know when that happens. I'll take a look and feel a little bit of pressure too. Chris: Yeah. I'll give you the link. Lukas: Awesome. Excellent. Thanks for your time. Chris: Okay. Thanks, Lukas. Take care. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So, check it out. Lukas: I guess this doesn't have much to do with machine learning, but it's my podcast so I'm going to ask it anyway. If there was a transporter, like a Star Trek-style transporter that would disintegrate your body and rematerialize it somewhere else, does that seem like a safe thing to use? Would you get into that and use it to transport yourself? Chris: Oh, that's a good one. The answer I'd love to tell you is like, ""Oh yeah, awesome. That would be amazing."" But yeah, I would never get into that. Just like I would never jump out of a plane. Sometimes you learn things about yourself and the fact is that, yeah, I'm not going to jump out of a plane and I think I'm not going to be an early adopter of the Starship transporter. So, yeah. Sorry, two answers. Lukas: Would you be a late adopter? If everyone else was using it, do you think that would convince you? Chris: I think so, yeah. I definitely think I would be...it would have so many advantages that I think it would be almost irresistible not to use it. But yeah, I don't see myself as one of the first people jumping into the transporter. But certainly it would save us so much time and we could just get to London for a meeting like that and then come back. That would be super awesome. Lukas: You wouldn't be concerned though, that it's a different person who shows up in London, even though they act like you and look like you? Chris: Yeah. Well, it's interesting I haven't...it's been a while since I've been... I used to do a fair amount of quantum physics and I haven't done very much lately in my current job, but as far as I can remember, there's not really a bound on how your molecules get reassociated. There's definitely entropy and energy loss in that process. So, I think there are definitely some challenges for sure. And we'll have to see. That'll definitely be a good test, maybe you have to pass some kind of a test when you get out of the transporter to see if you're still you. Lukas: Interesting. All right. Okay, here's another sort of fun one that we've been asking guests. What do you think about the singularity? Do you imagine a world where ML gets smarter than us in kind of every way, and we just stop working? Does that seem likely or unlikely to you? Chris: Oh yeah. That's such a great question. I think my thoughts on this have been kind of formed by the Minsky school of thought to some extent. It's interesting that as we get into a more connected society, you can kind of think of intelligence as these nodes that are connected, and information gets shared between the nodes. I think one thing that's sort of really interesting about that, is that as we put more and more nodes into the network — whether it be sort of Facebook or social media or a computer — as we add more and more nodes we're actually not seeing intelligence emerge out of that. We're in some sense in some of the social media sites, we're seeing the opposite. Sort of like an anti-intelligence almost emerging. It is sort of an interesting thing, because I don't think anyone would've predicted that 10 years ago that we'd be having these problems all of a sudden. When you put everybody in communication, we have all these problems that kind of popped out of that. I don't think anybody would've predicted that. Certainly I didn't. In terms of a sentient computer that's going to take over everything. I think we're a ways away from that. Is it a possibility in the future? Yeah, absolutely. How close are we to that? It's pretty tough to say. I know that with training computer vision models, these things are pretty far from AGI. The teams that I work with and the folks that I interact with are under no illusions that this thing is anything other than a specially trained model to do a certain task. So, they're nowhere close to what I would sort of describe as AGI. I do think that we have a long way to go before we really have to worry about a Skynet sort of thing, taking over the world. I don't think we're anywhere close to that by any means.",9161 +"Kathryn Hume — Financial Models, ML, and 17th-Century Philosophy",https://www.youtube.com/watch?v=Az0qfCfPcaI,3128,2021-12-16,"Kathryn: We would love to go in, have the business partner say, ""Here's my task, here's my baseline performance. Let's see if this is viable. If you can increase performance upon this baseline by X%,"" or whatever. I don't think I've ever seen an instance where that happens in real life. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Kathryn Hume runs Borealis AI, which is the AI arm of the Royal Bank of Canada. She works on a slew of machine learning applications that we get into. She has a background in comparative literature and speaks Latin, and that's surprisingly relevant to the work that she does and to our conversation, and to find out why you're going to need to listen to this one. Lukas: All right. Why don't we start with what you work on? You seem like you have like a really interesting job in an interesting organization. Maybe you could describe it and talk about what a day in the life is like? Kathryn: Yeah, for sure. So I lead up a group called Borealis AI, which is the machine learning research lab for the Royal Bank of Canada. For those listening in the States or outside of Canada, you might not know of it, but it's actually the largest bank in Canada, and I think it's the ninth largest bank in the world, so it's a pretty big shop. There's 90,000 employees in the company. Borealis was founded in 2016 as just the ML research center for the bank. Do day to day in my team, I think like many other ML shops, we learned over the years, that it takes more than just scientists to really make ML production, ML systems work. We've got a good group of machine learning scientists. We have ML engineers who do a lot of the work in building...taking the code from the scientists and really building out production ML systems. We have product managers who do what product managers do, figure out what we should build and really collaborate to make sure we ship things on time. And then we have a group of business development experts who work with our business partners. In the bank there's mini markets, if you will. There's the retail bank, which works with people like you and I, so checking accounts, savings accounts. There's wealth managers who help manage people's assets. And then there's capital markets, which is sort of the institutional investing. They partner with those various teams and help us find ML use cases. Lukas: Could you describe what those use cases are? Kathryn: Yeah, for sure. I'll talk about some of the products that we've worked on and various use cases that we see. So it's a broad variety, I'd say, of applications. If we go into the retail bank, things that are helpful for people like you and I, one of our recent applications was a cashflow forecasting application for day-to-day customers. Probably there's lots of customers out there who have missed a bill payment or potentially overdrawn their account and gotten fees from the bank for having insufficient funds in their account. We built something that's trying to predict upcoming payments in the next seven days. We stuck at a target of about a week out to give people reminders that say, ""Hey, this thing is coming. You might want to either pay it now or take different kinds of actions in managing, moving money from savings to checking, et cetera, to be able to cover those expenses."" Lukas: Wow. Is that really live? That really works? Kathryn: Yeah, it's live. Lukas: I've never gotten a message like that from my bank. Maybe I should switch banks. Kathryn: It's live as of...maybe a month ago we went into production? So it's really new. Lukas: Wow. How cool. Kathryn: But, yeah. That's one of the latest ones we put out, and it's an interesting ML problem because these...if we think about ML models, they're going to take a series of time series data, things that you've paid in the past and then try to generalize a prediction, ""What's that going to look like in the future?"" Lukas: But if you have something, say it's your electricity bill or your phone bill...my phone provider decides every once in a while that they're going to increase my rate arbitrarily, and I don't know why this is coming, but it just happens. And it goes from $75 a month to $85 a month, and I have to just pay that extra fee. Kathryn: One of the cool things in our model was, we needed to use an attention mechanism, so as to basically not overgeneralize that trend, right? It's not going to see that there's... It's 75, 75, 75, 85, 85, and then imagine that three months later, it's going to go to 95. We had to sort of correct for those kind of stepwise changes that a provider might make, that an algorithm might incorrectly generalize as a trend. It was one of the many micro nuances to actually making this thing work. Lukas: That's cool. Or your energy bill might go up in the winter or I guess in the summer if you're in a hot place, right? Kathryn: Yeah. There's a lot with seasonality for sure. Yeah. You can't just take the average trend over a year or else you're going to end up with... I don't know. Somewhere in the midpoint in September when it's actually... It would tend to be more stepwise. Yeah. Lukas: Can you talk about how you came to make that? I mean, I guess I always assumed that banks kind of liked charging really high overdraft fees and would want you to maybe do that, but I guess that's wrong. How did you kind of come up with that as...working with the business to know the business wants that, and then also realizing that's a feasible ML problem that you could actually solve? Kathryn: I was actually quite proud of this. At the Royal Bank of Canada, there's a little bit of...it might have to do with the Canadian banking mindset, but there's...part of it is be a profitable institution, but then equal in part is, ""Be a good citizen, be good to the Canadian citizens."" Slightly different than in some of the US environment in that the Canadian population is very...it's not underbanked. ""Underbank"" is a term that we use for people who are not within the recognized banking system, so a Chase customer or a Bank of America customer, or a Morgan Stanley customer. I think in the US, I don't know exactly what the statistics are, but it's something like 30-ish percent of the population is actually not using a bank, a registered bank. They use things like payday loans and sort of the sort of on- the-side type banking products. But in Canada, I think it's 98% of the population actually is in the main banking community, which then has implications for sort of social responsibility from these pretty large institutions because you've got the whole population represented. I found it quite promising that the executives...the business executives making the decisions felt that it would be better for customer loyalty to provide a service that helped versus do this sort of nickel-and-diming on these kind of fees. Finding the use case, it's always an iteration. I think in the ideal, coming from the academic ML world, we would love to go in, have the business partner say, ""Here's my task, here's my baseline performance. Let's see if this is viable. If you can increase performance upon this baseline by X% or whatever. I don't think I've ever seen an instance where that happens in real life. It's more like, ""Hey, we have this idea. What do you guys think?"" And then we're like, ""All right. Let's play around with some of the data and see what we can find,"" and see if there's a there, there. We'll come and we'll say, ""All right, we think this is the task,"" and then it's like, ""What timeframe do you need?"" When we originally started this, I thought cashflow was six months out. Then our partners were like, ""No, no, no. One week."" And I was like, ""Okay. Well, that helped."" So there was sort of iterations on just narrowing down the scope of the prediction, what qualified as decent performance. I often find with the business, the preference is it's accurate every time. Lukas: That would be the preference. Kathryn: Yeah. The preference is always, ""Actually there's no machine learning involved and this is just a rules-based system that works like clockwork."" So then it's sort of iterations where it's, ""All right. Well what if this edge case, the prediction is off? How might that impact customer experience?"", this sort of iterative negotiation to get to the point of, ""Yeah. We're comfortable with this as a starting point."" Then there is the selling and pitching and telling the story, getting the various people involved to get it to market. I'd say with our group too — since we're a machine learning team, but we partner with other groups in the bank that do design, that do just a lot of those sort of ticking the boxes around all the business processes — there's a lot of back-and-forth in stakeholder management to really get something live as well. Lukas: What ended up being the level of accuracy that you could get with that sort of seven-day prediction if I'm going to overcharge my account? Kathryn: It varied per payment type. If you had a pre-approved payment, like your Spotify subscription or whatever, it was quite high accuracy. Also the day, the day that the payment would come out was quite high. We had a multi-objective supervised learning algorithm that predicted...the one task was ""How much?"" And the second task was ""When?"" With those, it could get pretty high. I think within three days range we were at...I don't even remember. I think it goes down to 88 or 89% if it was...no, sorry. Within three days, it was up to 98. And if it were the exact day, it was more like 88, 89. So with those pre-approved, it was quite high. Once we got into things like loan payments, anything that's sort of an arbitrary e-transfer, like a Venmo payment, those are harder because there's just not a lot of predictive...there's more variability. There's not that kind of just standardization. So, yeah. It varied per payment type, but sometimes it was as high as 98 and then it would skew down to about in the high 80s. Lukas: You were using attention, a model that included attention? It sounds like a lot of machinery for this kind of problem. Did that really matter? Did it do much better than a simpler baseline? Kathryn: Yeah. It's a great question. I posed the same question to the team, being like, ""Wow, this is a lot of machinery for this kind of problem."" I think it came down to, again, the seasonal variability. You'd imagine that it's something where it's...it seems like it could just be a pretty standard approach. But, just with these things that creep up, like seasons, like variation and payment time per individual...some people have set up automated stuff and the minute it goes, it comes. Other people, they haven't, so they do it manually and there will be these lags in when they pay so that has implications not only on this thing is due, but how that impacts their balance. Once you get into the details, it becomes more messy and needs more machinery. Lukas: Can you describe some of the other applications? That seems like such a surprising and cool and one but what are kind of the main bread and butter bank applications of ML? Kathryn: I want to talk about another cool one. Lukas: Oh, yeah. Kathryn: I want to do another cool one. I'll do a cool one first. This one is really artful because you have to scope it down really small, but it's cool. It's using reinforcement learning for trade execution. Here's the problem. Imagine you're a big hedge fund and you trade every day, right? You just come into the equities markets and you order millions of orders of some sort of stock and you trade it through the day. The question at hand is, you come in, you decide that you want to execute a million orders of Google shares over the course of the day. The stock market opens at nine, closes at 4:30. The question is, ""How do you distribute that order optimally throughout the day so as to achieve your desired returns targets?"" There's a common historical algorithmic approach to solving this problem, which is called a VWAP algorithm. VWAP stands for Volume Weighted Average Price, it's the average price of the stock throughout the day, weighted by the volume that's traded, as the name suggests. I don't know when these kind of algorithms came into being, but I think they date back to the 90s or something, so it's- Lukas: Sorry to interrupt it. That seems like it would be a number. How is that an algorithm? Kathryn: It's a curve. It's a number, but it's a number...if you trace it throughout the day, it will change slightly, right? The algorithm that we took was, ""What trades do you place to hit that number?"" Lukas: Oh, I see. Kathryn: I won't go into the complexities of how the limit order system works, but effectively think, ""Do I sell, hold or buy the stock?"" It's not exactly that, but I think for the sake of...we don't have to go into this arcane detail. You could buy or sell or hold, right? So this is where reinforcement learning comes in. You have a sequence of a couple of kinds of actions. Lukas: Okay. Kathryn: You might sell, you might buy, but yeah, you're trading stocks throughout the day. Lukas: What would the simple algorithm be? Because you don't know what the stock price is going to be, right? Kathryn: You don't know what the stock price is going to be, but the simple algorithm, from what I understand, is you've got historical price curves. How the stock performed yesterday, two weeks ago, a month ago, et cetera. You'll use that to make a guess on how much you should buy at a certain time of the day in order to achieve your target goals, the target money-making goals that you have for the day. The algorithm basically releases – buys — or holds, right? Or doesn't release. And it partitions that at timestamps, ""At 9:00 AM, I'm going to do X, at 9:15."" What you don't want to do is — say if it's a large order, a million orders of stock — if at 9:00 AM you say, ""Buy a million,"" that has a big impact on the market price and it will sort of shake things off, right? Lukas: I see. Kathryn: You try to dose it. Just a little bit at a time, without disrupting the ship, right? With keeping the market relatively stable. Because you're one of the participants, but there's going to be however many others on the exchange at that time. So what we did here is we said...we were trying to hug that number, the price number across these time stamps. What we said is, ""Can we use reinforcement learning to optimally distribute, dose our buy decisions to stay as close as possible to that curve?"" The value function was basically just, minimize the distance. We can observe that number. Lukas: Sorry. The curve is the volume? Kathryn: The curve is the average price weighted per volume. Lukas: It's the price weighted by the volume. Okay. And you want the curve to do what? Kathryn: What you want to do, there's a trading strategy where you want to hit that curve. You want to sell or buy in such a way that the price- Lukas: It matches. Kathryn: -that you arrive at matches that curve. Exactly. Lukas: Does it work? Kathryn: Yeah. The cool thing is it works quite well, even when there's a lot of volatility in the market. In March of 2020, when COVID hit, it was just much more volatile than the stock market normally was. What was nice is, it adapted. I don't know the exact time it took to adapt and nonetheless get superior trading returns. It may have taken a day. It may have taken a couple of weeks. I'm not sure on that timeframe, but I know that it adapted much better than a standard trading algorithm would. Lukas: I guess you're constantly retraining the algorithm then? Kathryn: Constantly retraining the algorithm. A different team — not our team now, but the team that now owns that algorithm — are working on adapting the task to different kinds of trading styles. Not this ""Hug that VWAP curve"" that I described, but there's other approaches and strategies that one could take when trading, and they're retuning it to see if it could work there. There's a lesson though in reinforcement learning and that it's not...you can't just scale it to a new use case. It requires significant effort to write a new algorithm that will work with a different task. Lukas: Right, right. Interesting. What are the other kind of important applications to you? Kathryn: Yeah. So other, more bread and butter applications. Lukas: Or other cool ones, I guess if you've got other ones you want to talk about. Kathryn: There's another cool one we could talk about down the line that's that's a little less related to banking. I like to think about it this way. What does a bank do? A bank takes in money at one rate — you put money in your checking and savings account — and it loans out money at a different interest rate and it makes money on the spread, right? That's kind of basically what a bank does. Historically, when banks have used models, statistical models, to decide how much they should lend to a given customer, there's plenty of background models that are using linear regressions, et cetera, to do this. But there's a lot of opportunities to upgrade some of those decisions using ML with more data, different types of data. But the basics of, ""Who should we give a loan to? How much should that loan be? What is the risk that the person will...that we incur that the person might default on this loan? If they do default, when should we call them?"" Ranking that ordered queue. There's process optimization. We have our call center. Very often, we go into... especially today, we have digital banking. You go on, your password doesn't work. Something happens, you're stuck. You have to call somebody. There's a lot of applications in call center automation. The conversational AI work automating some of that queue, rank ordering queues. Whose call should we take first to approach this? Banks often have a series of products. Which product offering do we send to which customer next? Those are kind of more standard industry problems. I think that exists everywhere, it's not unique to banking. It's sort of the next-best product offering optimization. Lukas: Do you work on all of those problems? You and your team? Kathryn: We don't work on all of those problems. No. We do a lot of work in credit, but there's other teams in the bank who work on various other data science problems like this. Lukas: Well, tell me about credit. That's something I don't know a lot about, but I do know that we've had various sort of mortgage crises that I think were, at least...the publicly available information seemed to indicate that it was too much machinery leading to bad decisions. Do you think that's accurate? How do you think about machine learning in that context? Kathryn: I'll speak on this from the perspective of somebody who was not a banking expert in the 2007, 2008 credit crisis. This whole collateralized debt obligations, right? There's the tip of the spear, which is two people deciding to...let's say, I decide to lend you 10 bucks, and I think about whether or not you're going to give that back to me. And I say, meh. If he doesn't pay it, it's also okay. All the way to, ""I've got a set of mortgages and I'm a different institution who is going to hedge a strategy on some other institution's mortgages, and I don't have insight to the quality of them."" That's when you get into these sort of layered risk management strategies. In terms of ML's engagement here, I think the big thing is the regulators have caught up. There's always...some action happens and then activities, and then there's been a lot of regulatory oversight since 2007, 2008, to try to protect the economy by putting some limits on banks. There's a thing in Canada, we call it the CET1 ratio, which is a ratio that's used to manage the liquidity that a bank has, this sort of overall cashflow over risk-weighted assets. These are something like a mortgage, right? The asset that a bank might hold that has some risk associated with it. A bank has to manage that ratio in such a way that if something bad happens, there's still relative stability. If you think about adding ML into this mix...let's say we were to use ML to calculate the risk factor in the denominator of that one equation. They want a lot of transparency and explainability, right? There's a lot of governance oversight that's like, ""We're not just going to put in a black box neural network and see what happens."" High need to select models for those kind of use cases that are quite transparent and audible and where you can clearly understand how input feature is leading to output. Lukas: What kinds of models do you end up using? Kathryn: It varies per use case and context. Ranging from the cashflow one that I talked about is a deep LSTM. There's an LSTM backbone also in the reinforcement learning one for trade execution. Two, sometimes. I'm not sure if my team has done much, but there's a lot of decision trees. There's a lot of XGBoost models for some of the credit work. We have a governance tool that we've built that is optimized for decision trees, because there's a lot of models in the bank that use those. Lukas: This is a single tree or a boosted set of trees? Kathryn: It varies per use case again. Lukas: It seems like probably a lot of applications, but mortgages specifically has kind of a long history of at least racial inequality. How do you think about that? Are you able to look at the models and get some sense if they're being fair? How do you even define what fairness would mean? Kathryn: Yeah, great question. We haven't done any work on mortgage predictions in particular, but we have done some work with credit and we do fairness. There's a lot of fairness tests prior to putting a model into production. At the bank, there's a group called Enterprise Model Risk Management, and there is...it's interesting. I don't actually know if there's a preference for individual- or group-level fairness testing. I do know that there is a tool we've built that focuses on individual fairness. Lukas: Sorry, what would that mean? Individual fairness versus group fairness? Kathryn: If you've got two groups where a group is defined by some similarity on a feature — let's take the example of race — so you've got the black group and the white group. The group level fairness is going to be, ""Is the error rate on the black group proportionate to the error rate on the white group for some prediction task?"" If you go into individual-level fairness, if you have a set of features that are similar to my set of features, then if I get a $5,000 loan, you too get a $5,000 loan. So we have tools, but I still believe there's a decent amount of subjective interpretation that goes into, ""What aspects are we trying to calibrate as 'fair'?"" Lukas: Yeah. I mean, sometimes it seems to me with machine learning, it forces us to be more clear about what we mean by fairness and that can...just the way it's easier to kind of quantify the unfairness sort of leads to a lot of debate, right? I mean, how do you account for features that are correlated with group fairness, right? It seems always challenging. What does it mean to really prove that your model is being completely fair? It seems like a hard thing to rigorously define, although I'm sure a lot of...I mean, we should get people who have thought about it deeply on this podcast, for sure. Kathryn: The last I checked, there were 21 current interpretations, technical interpretations of what fair means. Lukas: Is there a list somewhere? Do you have a link that you could give us? Kathryn: Yeah, I can definitely find it. There's a paper from..this is from 2019, so maybe it's been...or something like that. But, yeah. I can definitely send a link after this call. I've seen things like...at the bank, there was one where these proxy correlations, you might want to say we don't want to discriminate by gender. It was one for a business loan, but they kept in the business code type. It was restaurant, retail, manufacturing, blah, blah, blah. And one of them was beauty and spas. As it happens, some very high percentage of the proportion in Ontario that are beauty and spa owners are women, so there's this proxy encoded. They sneak up all of the time, right? If you really dig into it, you can keep going and uncover these potentially unfair variables, so. Lukas: Interesting. We found some other interesting applications that you've talked about or your team's talked about, like a text-to-SQL database interface. Would you want to talk about that at all? Kathryn: We built this tool called...we called it ALANN, which was ""A Listening Answering Neural Network,"" I think is what the acronym stood for in the beginning. It's a text-to-SQL interface. Basically, user comes in, poses a question like, ""Find the highest-rated stocks in my portfolio"" or something like that. And the system takes that query, goes into an SQL database, and — one —parses from a natural language utterance into something that's a little bit more structured so it looks like a SQL field, and then — two — can actually go and compute the operation and output and answer. So, ""Google is the stock that has the highest-rated portfolio"" or whatever it was that I said as the potential question. Lukas: Right, right. How did you frame that even as a machine learning problem? How did you get training data? Did you view it as an NLP, like a sequence-to-sequence model type thing? Or how did you think about that? Kathryn: Yeah. It's a great question. The person on the team who built it, Yanshuai, would be better equipped to answer it than I, but it was framed that way. Framed as a sequence-to-sequence mapping problem. When we started the application, Transformers hadn't really taken off yet. Midway they had, and it ended up being sort of this, ""How can we adapt Transformers to very small datasets?"" Because we have very small...there's close to no training data mapping natural utterance to extremely structured, pseudo SQL. So we built this, we kind of bootstrapped this pseudo-SQL database. I had a bunch of labelers come in and be like, ""Yes, this is what..."" It was sort of a pick list. It was like, ""If you say this question, does it mean X, Y, or Z?"" They labeled the pick list and we had that as our bootstrap training dataset and decided on the application because there's a lot of SQL databases in the bank and in a lot of large enterprises. Often you've got a handful of folks who are the analysts who are called upon to go and do these queries and find answers. They'll build dashboards, like a Tableau-type dashboard where that's sort of commonly posed questions. They're FAQs where it makes sense to automate. Every month, you see the chart. But our original hypothesis was there's probably lots of long-tail questions that it doesn't make sense to program, but that it would be really nice...but you also don't want to have to call in the data analyst to do the work on. Can we just have people ask those questions to the tools? Lukas: Interesting. Lukas: Switching gears a little bit, I was hoping to hear a little bit about your career and how you came to this really interesting job. You talk about coming up through humanities — although I think you do have a math degree, which is a kind of a technical side of humanities — and then you did grad school in comparative literature, right? Which is a little bit of an interesting switch, although I had a couple friends in college that did math and comp lit, but I was always struck by that. I wonder if you could talk about what you were thinking at the time and how that informs your work today? Kathryn: I'm glad you noticed that I also have math background, because people often are like, ""How does literature and then machine learning...?"" And I'm like, ""Yes, but I did do a lot of work in linear algebra,"" so at least I can imagine functions. It's a great question. I wish I had a master plan, but I didn't have a master plan. I actually intended originally to be a physics and philosophy major. Those were the things that interested me most. I was kind of a klutz in the lab. I really didn't like the lab, so I was like, ""You know what? None of this physics stuff. I'm going to do the part where you don't have to go into the lab and just do math."" I always loved humanities and I spent my junior year abroad in Paris, and I didn't have to take any math courses because I had enough sort of standing credits. So I took courses in philosophy, film, literature, and I really loved it. I decided to change my major my fourth year in college, and instead of just doing math, do a double major in math and comp lit. The good thing about comp lit is that it's kind of...well, the good and bad thing. The bad thing is that it kind of lacks identity as a discipline. It's kind of a grab bag of it used to be...imagine you take a theme like ""love"", and then you say, ""How do the French write about it? How do the Germans write about it?"" And you find these sort of cultural role overlaps, which was the comparison. As the discipline has evolved, it's kind of become...some people focus on philosophy and literature, some people do cultural studies, some people do rigorous sort of history of a national literature. The ambiguity was good for somebody like me, because it was like, ""Sure, you want to do math, history of philosophy, history of literature, languages, semiotics? Great! Great place for you."" I went into it. I really liked languages and I thought it provided a lot of freedom to explore. I wrote my dissertation on 17th-century epistemology, basically what was knowledge at the time, and focused on Descartes, Leibniz, Newton. Sort of the old, dead white guys, and- Lukas: Classic math guys. Kathryn: Classic math guys. Exactly. Yeah. I know a lot about 17th-century math that's not really as relevant today. Lukas: Oh. Tell me some stuff about 17th-century math. Kathryn: Favorite things in 17th-century math. It's the dawning of calculus, right? Lukas: Yeah. Kathryn: You've got Newton. Newton in particular is really, really fascinating. Leibniz and Newton, both of them. Leibniz was...he had this thing called ""Cogitationes caecae, ""blind thought"". He really thought that basically we could just let the symbols do all the work and it doesn't matter if we can visually represent some mathematical concept or if it really has a tie to the real world. It was just, ""Let's go calculate stuff."" With that sort of focus on formalism, he did a lot of...he had a lot of development of thinking about infinitesimal ratios and some of the mechanisms that go into making differentiation and integration possible, that just kind of worked. Newton on the flip side, kind of started off more on this formal track, but then he was influenced by a bunch of traditional focus on Greek math that was really prominent in 17-century England. There, they were like, ""You have to visualize. It all comes back to geometry."" Geometry started with farmers out trying to measure distances in a field and it needs to be grounded. He grappled a lot with thinking about the gap between a limit and zero, right? You see that through the Principia. I wrote a paper at one point on his notion of...he called them first and last ratios, which were basically proto-limits. He kind of held himself back because he was really so focused on keeping things tangible, which I found really interesting between the two of them. So, yeah. One 17th-century math tidbit. Lukas: Did you continue this line of research in grad school? Or was it something else? Kathryn: I continued the line of research on 17th-century math and philosophy in grad school, wrote a dissertation that five people have read, on this topic. Then afterwards basically, with comp lit...word of the wise for any listeners who decide to be comp lit grad students, there's not a lot of comp lit departments — there's a lot of national language department — and there's not a lot of availability. Kathryn: I think if I had been able to become a philosophy, history of philosophy, professor, I probably would have stayed an academic. But I was sort of prepared to be a French literature, 18th-century professor, and I was like, ""I don't know if that's really me."" There's not a lot of jobs. So it's like, ""Do I go to Nebraska and fight for my assistant professorship? Or do I go into tech?"" I was out at Stanford, so I just decided to switch careers. What is the humanities training, is that still with me, besides having arcane knowledge that not many people want to talk about? But I'm glad you do. Normally it's a liability for me at work because I get feedback on performance reviews that are like, ""Kathryn's really great, but sometimes she goes off on these philosophical digressions and we're not really sure why."" But I think one thing that I've brought with me is...I trained as an intellectual historian in grad school. If you're a philosopher, often today you're evaluating arguments for, ""Is this right? Is this true?"" And then there's people who come in and say, ""Well, there's no such thing as truth in the first place and everything's relative."" I think as an intellectual historian, I didn't care if Descartes was right about the motion of planets and space. I was really interested in understanding what he thought he was thinking. Why this? What was he reading? What was happening around the time? Sort of saying, ""All right. I'm reading this as a 21st-century reader and I'm coming with all of my prejudices and predispositions of thinking like somebody who's on the internet and viewing the world in a certain way and thinks that universal gravitation is second nature,"" but for him it was not. I think there was a lot of training in suspending disbelief, ensuring that one didn't bring in one's own subjective predispositions and really understanding a foreign thinker. I actually think that's really good training for product management. I think it's good training for executive work. You're constantly in situations...like with a customer, it's not, ""Here's how I want to use my cashflow forecasting app."" There's going to be a distribution of millions of customers who are totally different from me. I guess I'm always approaching problems from the perspective of, ""I'm not going to assume that there's one right answer, and I'm not going to assume that this person thinks similar from me or comes from a similar place,"" and I think that's been really good training in doing product work eventually. Lukas: I mean, you didn't just study any comp lit, it's very like, different technical points of view. I feel like you see that in ML too, right? Lukas: I had a boss who always said he preferred to hire biologists over physicists, and I think what he meant by that is he liked people that didn't really try to figure out the underlying structure of models, but just examined them from the outside of what they do, right? Take this kind of open mind, ""We're not going to make assumptions."" But then I think about Newton actually, and it seems to me like...you tell me, actually. It seems like Newton made this leap into a lot of structure. He must have wanted to put an underlying structure on the world really badly to come up with such an amazing structure. Do you think Newton doing ML, it would have driven him nuts that we have this point of view of looking at the models from the outside and just examining what they do and maybe not worrying about exactly how they work and making them more and more complicated? Kathryn: Yeah. That's a great question. I think there probably would be aspects of ML that would have driven Newton crazy. There's other aspects where I think there's some kinship or predecent thinking. I'm influenced here by one of my dear friends and mentors, a man named George Smith. He's a professor at Tufts who...if you really want to know about Newton, talk to George. He's the guy. He's taught this course on basically how Newton changed the standards for high-quality evidence for 25 years and really knows a lot on this topic. One of the things I learned from him is that Newton always assumed that the system that he was trying to model was infinitely more complex than the deductive mathematical model that he could apply to it. There's a lot in the Newtonian scientific paradigm that's like, ""All right. We're going to put this hypothesis out there, or this deductive model. Then we're going to make observations and there's going to be a gap between what we observe and what we've modeled. The progress of this paradigm is to continuously watch that gap and close it when possible by refining our mathematical model, but sometimes realize where it's just completely off the mark and we might need to sort of shift our thinking."" To that extent, I think there's some...there's more affinity within sort of the ML mindset than a traditional rules-based computer programming mindset, or even the GOFAI-type mindset, right? As long as we can articulate the structure of the thinking, we can model the world. Lukas: What is a GOFAI-type mindset? Kathryn: ""Good Old Fashioned AI"", expert systems. Lukas: Nice. I didn't know that acronym. Kathryn: It's always been a plight of mine. Definitely spent a lot of my time in comp lit working on rationalist, hyper-structured, 17th-century thinkers and drove my comp lit colleagues crazy because I came from the math background and my papers were proofs versus more exploratory. I envy the ML mindset too, because I think coming from more of that ""always trying to prove things,"" it's not always the best approach to running a company either. Lukas: Do you think Descartes would have had a different point of view on ML? Kathryn: This is another loose analogy, but basically this whole...you know the famous, ""I think therefore I am"", ""Cogito, ergo sum"". He phrased it that way in 1637, ""A Discourse on the Method,"" which is like, ""Here's my method,"" and then he runs it through three examples, one of which being the geometry. One was on- Lukas: Sorry. Could you explain what that means? I've heard that a zillion times, but I don't think I know the implication of ""I think, therefore I am"". Kathryn: When he first stated it, what he was trying to do was big, bold 17th-century work. Prove that God exists, A, but then, B, put forth a new way of thinking and doing science that was cleaner and upon which one could actually feel like they had sort of...they could believe these statements and propositions of truths versus the predecessors, which were always citing the ancients. It's like, ""Why is something true? It's true because Aristotle said it was true,"" versus ""It's true because I have used logic to come to a propositional type of truth."" When he was starting his ""Let's prove that God exists,"" he says, ""Well, where do I start? Why don't I start by proving that there's some clear point that I can stand upon where I know this is what truth looks like."" And so the ""Cogito, ergo sum,"" was that point where basically he's like, ""No matter how hard I try, if I try to pretend I don't exist, there's got to be somebody there doing that thinking, therefore I must exist."" It's kind of this proof by contradiction. There's got to be some voice there. What's interesting is he rewrote this. In his second attempt at it, he got rid of the thinking. So he didn't say ""cogito"". He just said, ""I am. I exist."" And then he said, ""If you want to understand how this truth works,"" he didn't use these words, but my paraphrase, ""Go sit in a room and meditate for days."" Do it, repeat it, and do this for 30 days, and eventually you will have trained your mind to think clearly. I looked at that and I was like, ""Well, that's different. That's not quite what I thought Descartes was about."" Go sit in a room and repeat things until you train your mind to think that way? That was really interesting to me. This is a loose analogy, but I think there's something similar to supervised learning, when it's like, ""Is this a dog? Is this a cat?"" It's just like, ""Show me 50 examples"" and repeat until you've established the input, output pattern. That's not really there, but I think it's kind of there, and I think it's interesting that there's sort of this intellectual foundation for supervised learning in Descartes. Lukas: Although it seems like with Descartes, there's maybe no input if you're meditating? Kathryn: There's no input besides your own training your mind to rewire. It's like, ""Rewire your mind to think this way."" Lukas: I see, I see. Interesting. Lukas: Do you then have thoughts on AI being sentient? Do you have opinions on things like the Turing test? Or...what are those classic ones that you learn in your first AI class on the room with the person in it, and the book's doing Chinese or something? Kathryn: Yeah, the Chinese...the John Searle Chinese room argument? To be honest, not really. I find the Turing test interesting conceptually, but I struggle with the arguments that are this sort of singularity type arguments, like computation is rising and the models are more complex and these models are going to get to the point where they come into consciousness. I just don't really see it. Do you? I don't know. Lukas: Well, we have this existence proof with humans, right? That seems like consciousness kind of comes from some type of process. It seems to me like unless you think there's something really like God in there, in the physics somehow, there must...it sort of must come from increasingly complicated computation, right? Kathryn: Yeah, for sure. I sort of fall into the materialist. While I've spent a lot of time with the 17th-century philosophers, I don't share the sense of the soul and there's a God in there that makes things different. Fair enough, fair enough. But it's interesting. There's still that gap between...I don't know if it's the plasticity of our neurons, the fact that there's just thousands of billions, trillions of very plastic processes going on in there. Or if it's like a Dan... I don't know if you know Dan Dunnet-type argument where the self is basically a user illusion, right? In the same way that we interface with...I'm looking with you on a Zoom screen right now. My iOS operating system which makes it easy for me to engage with the computer versus seeing the nitty gritty insides, maybe it is a useful illusion that we've gained through evolution, but that it's not really real. I kind of buy that argument. Basically consciousness is a red herring, is kind of what that argument would be. Lukas: Are you the kind of person that would get into a transporter that would disintegrate your body and reassemble it somewhere else? Would you feel like that's a safe thing to do? There's some very strong opinions on different sides of that at Weights & Biases. Kathryn: That's a good question. I've never really thought about that. I might change now that I have a son, now that I'm a mom. If you met me a year ago, maybe I'd say, ""Yeah, sure."" But now it's almost like, ""Would that mean that there's some implication on my relationship to my child?"" I'm not sure I want that. Lukas: That's funny. I have a small daughter too, but I think I've...my whole life, I would just never get in that machine. To me, it seems incredibly unsafe. I don't know if I could justify it, but I just would not do that. Lukas: All right. We always end with two questions that are a little more on the practical side, but this is really fun. So one is, I guess what's a topic in ML that you think is understudied or underappreciated? Kathryn: I don't know if it's understudied, but I think it has been underappreciated and now is becoming more appreciated, but this is causal inference. The Judea Pearl-type work. This is something we're starting to really look into at the bank because of this need for sort of more interpretable models. And lots of conditional probabilities where if we could understand what happens in one variable and how that relates to another variable, it would be really useful for sort of macroeconomic modeling. It's a topic that a person like me is going to be interested in because it's philosopher candy as well. There's lots of interdisciplinary approaches to this problem. What is a cause? If we can even really define it well, and it's represented formally in machine learning in one particular way. I think it's going to be interesting over the next couple of years to see these sort of traditional causal inference methods interacting with deep learning and the deep learning community, so that's one thing that we're... I'm personally excited about, but also Borealis is looking into these days. Lukas: Interesting. Super cool. Is there a paper that you could point people to if they were interested in learning more about? Kathryn: Yeah, yeah. I'll send a link after. There's a paper recently. I know Elias Bareinboim, who is one of...he was one of Pearl's students and he's at Columbia, he was a co-author. Yoshua Bengio was a co-author, and then there's two others that I know as well. That's all about sort of deep learning and causal inference. That's probably a great place to start. Lukas: Cool. The final question we always ask — and you've seen so many different applications, I think you'll have a really interesting perspective on this — is basically kind of going from wanting to build a model for some purpose and kind of getting it deployed in production and actually doing that purpose. Where do you feel like there's the biggest painful... Or most painful bottlenecks? Kathryn: Yeah. There's often a lot. It's hard to say the most painful. I think at the highest level, it's really deep integration into the full business process. This is really coming also from an ""enterprise ML perspective"" versus a sort of ""ML for software"" company. I've seen tons of projects fail where you might have a good...given a task, build a model. If it's just handed over to the business without considerations of, ""All right. Where does the production data sit? How do we get that data from that environment to the environment where our model sits to do inference?"" There's always questions on just the timeframe, or is this batch monthly, weekly, real-time? I've seen stuff where there is...we think we can easily do a batched output, it's just monthly. Output is set up for predictions that are going to go into some call center list. But there's some nuance in the process where the third week of every month, they do this to the data and that's going to mess this up, and so it's always in the details of what that full flow will look like. Then the third, with the business process is, ""All right. Now you've output the prediction, but how does the process change?"" If people just use it and they continue to do what they're doing, I don't think you're really taking advantage of, ""Now that we have this, we can shift our approach."" Let's say it's a call center automation thing. We can shift the number of people we have on staff at a given time. We can collect the following new data to improve the process in some way. I think you have to think about it holistically, right? In terms of, ""What's the end result? Where does it sit? How do you measure it?"" That's kind of all of the production ML pipeline, but I actually think it's all there and it all matters, so. Lukas: Spoken like someone who's done a bunch of production pipelines, I think. Thank you. Thank you very much. That was really fun. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description, where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.",8514 +Sean and Greg — Biology and ML for Drug Discovery,https://www.youtube.com/watch?v=-CVJZQa-lvc,3325,2021-12-02,"Greg: Evolution is one of the most interesting aspects of informational science because it's the ultimate bootstrap system. You've got these letters strung together on DNA that have, over billions of years, encoded themselves into the most sophisticated system on the planet, and it's everywhere around us. In theory, artificial intelligence could look at that and understand every piece of it the same way that every cell does. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Today, I am talking to Greg Hannum, the VP of AI Research at Absci, and Sean McClain, the founder and CEO of Absci. I'm talking with them about drug discovery and development and manufacturing and how ML fits into that, and that's what Absci does. This is a super interesting conversation that I really enjoyed. Lukas: Why don't we start with you, Sean? Maybe you could explain to our audience what Absci does. This might be like explaining it to your mother or something, right? Everyone's sort of interested in these applications, but maybe doesn't really understand the deep biology or really even the industry that you're in. How do you think about that? Sean: Yeah, it's pretty simple. We are merging biology and AI together. One of the really exciting aspects of our technology is that we are able to screen or look at billions of different drug candidates, looking at the functionality of those drugs as well as the manufacturabilty. That's compared to what the industry is currently doing, is looking at drug candidates in the tens of thousands. If you look at a protein-based sequence like a monoclonal antibody — you're all familiar with COVID, Lilly's antibody that came out, that's a protein — and if you look at a protein sequence, there is more sequence variance in an antibody than there are atoms in the universe. What we're essentially doing is feeding in all these billions of different data points on the protein functionality and manufacturabilty to ultimately be able to predict the best drug candidate for a particular disease or indication. Essentially our vision is to become the Google index search of drug discovery and biomanufacturing where we can take patient samples, find the specific biomarker or target for that particular disease, and then utilize deep learning and AI to predict the best drug candidate for that particular target or biomarker. All at the click of a button, and totally changing the paradigm of healthcare and biotech, and ultimately getting the absolute best drug candidates to patients at truly unprecedented speeds. It's this really exciting forefront of, again, merging biology and AI together. Lukas: Do you ultimately take these drugs to market and sell them? How far do you go in this process? Do you just invent them and then hand them off? How does that work? Sean: It's really a perfect marriage of what we do and what pharma does. Pharma's really good at being able to design clinical trials, take the drugs through the clinical trials, and then ultimately market them. Where we come in is being able to assist the pharma and biopharma companies with actually designing and creating the drug itself. Then we out-license it to the large pharma to take through the clinical trials as well as commercializing it. We get milestones and royalties on that, which essentially, in the world of tech, is another version of a SaaS model, but based on the clinical trials and ultimately the approval of the drug product. Lukas: How far along is this? What's the drug where you've used these techniques that's closest to something that cures a disease? Sean: Yeah, so we have one product that we're working on right now that is in Phase III. They are planning on implementing our technology post-BLA approval. We're potentially assuming the drug gets approved. A few years away from actually seeing that drug on the market. So that would be our first drug candidate that would make it to the market utilizing our technology. Lukas: What does it do? Sean: Unfortunately due to confidentiality, I can't disclose that, but I'm hoping here in the very near future that we will be able to disclose that. I will say in general, most of the programs that we work on are either on immuno-oncology or in infectious diseases. But our platform's really agnostic to the types of indications or diseases that we can go after, but we really focus on where the industry's focused, and a lot of that is on oncology. Lukas: Is that because cancer's such a big deal and so many people get it or some other reason? Sean: Yeah, I would say that that is one of the big diseases that the industry is focused on and where a lot of innovation can be. Our technology is really an enabling technology, so we take the ideas that our pharma partners have, they're the experts on the biology, and saying, ""Hey, we need to design a drug that has these attributes that can do this."" We can then enable them to do that and that's across really all diseases and indications. Lukas: Forgive me for such basic questions, but I'm really curious how this works. So a pharma company would come to you and say... Is it as simple as, ""We want to cure this specific disease and we need a molecule that cures this disease?"" Do I have that right? I mean, how does that happen? Then what do you deliver? Is it like, ""Here's a molecule,"" or ""Here's 20 you should try,"" or ""Here's how we think about it?"" Sean: Yeah, I mean, the simplest way of looking at it, it's exactly how you described it. So they come to us and say, ""Hey, we have this particular target or indication and this is the biology. If we design a drug that has these attributes, we think that this drug candidate then could kill this cancer cell."" They then have to perform the animal models and then ultimately take it into the clinic to prove their hypothesis on that, and we're assisting them in being able to discover the drug candidate that has the properties that are needed to solve the biology problem that they have determined is going to ultimately cure or improve that particular disease. Lukas: When you say drug candidate, is that literally a molecule? Sean: That is. In our case, that is a protein that is being used as a drug. There's protein-based drugs and then there are small molecule-based drugs. So small molecule drugs, Advil, Vicodin. Basically a pill in a bottle. Then you have the protein-based drugs or biologics, such as insulin and a lot of the exciting monoclonal antibodies. Again, going back to Lilly's COVID antibody or GENERON's COVID antibody, these are all protein-based drugs. The interesting thing with protein-based drugs is you can't chemically synthesize it. You actually have to make it in a living organism. That adds more complexity to discovering these molecules as well as manufacturing them. Lukas: Can you predict exactly what the protein's going to look like and then look at it and see if it does it? Is that all in simulation or are there surprises when you actually try to manufacture it? Sean: Yeah, so there is a lot of surprises that can occur. We are not to the point where we can predict drug functionality. That's ultimately where we're headed with all of this. A lot of times, if you can predict the functionality of a protein, that doesn't necessarily mean that you can manufacture it. So many times we see with large pharma, they discover these really exciting novel breakthrough protein therapies, but ultimately can't take them to the clinic because they can't manufacture them. You not only have to predict the protein functionality, but you also have to be able to predict the manufacturabilty of it as well. We're really looking at both of those. Really what AlphaFold has done with being able to predict the protein structure based off of the amino acid sequence, where we're headed is being able to predict the protein function or protein-protein interaction. So it's the other side of the coin. It was a huge breakthrough for AlphaFold for basic research. What we're doing is going to be a huge breakthrough in drug discovery and biomanufacturing. Again, that's the opposite side of the coin from what AlphaFold has done. Lukas: I want to make sure I heard you right. Did you say you're not predicting the functionality? Sean: We are predicting the protein functionality. Lukas: The functionality is how it interacts with another protein? Sean: Exactly. It's ""How tight does it bind to another protein?"" Then also, we take into consideration immunogenicity. Is it going to react in the body once it's administered? Then also taking a look at the CMC or manufacturing aspects. Is it soluble and stable? Can it be produced at high yields? These are other predictions that we take into account or other attributes we take into account. Lukas: Interesting. I want to hear more about how this actually works, but, I guess, one question I want to make sure that I asked you is that I saw that you started your company in, I think, 2011, right? It seems like ML as applied to medicine has changed so much. I'm curious if you started your company with this perspective or how different it was, and also how your perspective on machine learning has changed as machine learning has evolved and deep learning's come along. Sean: We did not start off as an AI company. I would say we are very similar to Tesla's evolution. Tesla started off as an electric car manufacturer. They started collecting all this data from their sensors, built an AI team around that, and now they're a fully autonomous self-driving car tech company. That's a very similar evolution that Absci is on. We started out on the biology side and engineering E. coli to be more mammalian-like to really shorten the development times and decrease manufacturing costs. We then built out this technology that allowed us to screen billions of different E. coli cells and look at different variants of proteins, looking at basically the drug functionality and then also looking at, ""Can you actually manufacture this?"" We started generating all this data, billions of different data points on the protein functionality and the manufacturabilty. We knew that if we could leverage that data with deep learning, we could get to the point where we could predict the protein functionality needed for every type of target or indication, and that's ultimately what led us to apply our Denovium pioneering deep learning technology for protein engineering. But it really started off with the data. Data is so key and we have proprietary data that no one else has that we are then leveraging deep learning to mine that, to get us to the point where we can ultimately predict protein functionality. Where we're currently at right now is being able to leverage the data we already have and be able to predict the best billion-member libraries we should be screening for, for every new target and indication we work on. Eventually, as we train the model with more and more of our proprietary data, the more and more predictive it's going to get. Instead of predicting a billion-member library, it starts predicting a million, a thousand, and then ultimately predicting the absolute best drug candidate for a given target or indication, looking at what modality should it be, the affinity, low immunogenicity, all the manufacturing attributes that you want. Right now, it's a race to feed as much data as we possibly can, but it all started off with the biology technology that we had originally developed. Lukas: For you, Sean, as CEO of a company that's not a deep learning company, I'm curious how you first got exposed to deep learning and what made you think that it might be useful, and then how you got conviction around making these large investments in deep learning that you're doing now. What were you seeing that made you feel like it would work? It seems like you're more bullish on it than maybe a lot of your peers and I wonder where that might be coming from. Sean: I'm bullish because we have the data. Again, it all goes back to data. We have high-quality data on the protein functionality and manufacturabilty. It goes back to an earlier point that I made, which was there are more sequence variance in an antibody than there are atoms in the universe. There's no screening technology that we could ever create that would allow us to mine that big of a space. That's really where the deep learning comes into play, is being able to essentially sift through all of the potential evolutionary paths that a drug could be created in and figure out what is that best drug candidate, basically mine that whole search space, and ultimately come to the point where we're creating the best drugs for patients. I think we've seen huge...once we've implemented the deep learning technology, we've already seen huge gains in terms of yields and the types of drugs that can be discovered when taking our data and pairing it up with deep learning. Ultimately where I see us going is becoming a full tech company once we have enough data here. I'm extremely bullish on AI and what it can do within healthcare. Lukas: It's interesting talking to you in that we work with, I guess, a lot of pharma companies, which I see are slightly different in what they do than you, but it seems like their perspective is ""interested in deep learning, but probably not at the CEO level,"" except the sense that they're making, I'd say, small or medium investments whereas you want to transform your entire company in this direction. Do you think that you're doing something different than your competitors around deep learning? Do you think that you can be the best at this in some way? Sean: I do think that we can be the best. I would say that the industry is starting to understand the benefits of what deep learning and ML can provide. Biotech probably doesn't have as great an appreciation for tech and machine learning and really what that really means, and vice versa, that the tech industry doesn't quite understand all that goes into biology. It's really exciting to be able to take two industries, two cultures, and merge them together to really create something that's going to be hugely impactful for patients and ultimately the world. Lukas: That's super cool. I mean, thanks for doing an interview like this. I think this is really great for cross-pollinating ideas. I love these. I have a lot of maybe slightly more technical questions. Greg, feel free to jump in if you like. Lukas: One thing I wonder about with ML applied to this stuff is, do you feel like it was always a latent possibility to successfully be able to make these predictions that you're doing now and it was just a matter of getting enough data? Or do you feel like there's been breakthroughs in machine learning, in model architectures or something like that that have actually made this a more practical application? Greg: Yeah, thank you. It's a great question. I would say that it's a little bit of both, that there has always been potential for ML in bio and has been very successful in the past in some of these same indications, but it's been limited both on the data collection side — which is not stagnant, it's moving in incredible ways, the same way that the AI community has, and the AI modeling...recent advances in large-scale architectures, transformers, a lot of different techniques for getting these models to converge successfully and to be very predictive have been incredible breakthroughs as well. Essentially now I'm less concerned about the AI holding back any sort of success as I am about making sure that we can marry these two communities, make sure that what is always an intrinsically messy process of collecting biological data is actually connected to the inputs and outputs of that AI. Which, as Sean will be the first to tell you, this is a great place to be able to do that at because a lot of that hard work of actually developing these assays and working through that challenging space is part of the bread and butter of Absci. Lukas: Could you give me maybe a concrete example of an ML breakthrough that would help with this? For example, I think of transformers as... I know them as technology mostly for natural language processing. I could sort of imagine how this might apply to what you're doing, but maybe could you walk me through some kind of architecture, some kind of new way of doing things, and how you framed the biology in this machine learning world? Greg: I'll give a couple of examples that have come over the last few years. The biggest is related to scaling. The biological problems are necessarily complex. Evolution is one of the most interesting aspects of informational science because it's the ultimate bootstrap system. You've got these letters strung together on DNA that have, over billions of years, encoded themselves into the most sophisticated system on the planet. It's everywhere around us. In theory, an artificial intelligence could look at that and understand every piece of it the same way that every cell does. What you need to do to connect these dots now is in collecting enough data of different parts of the system. Namely, you need a lot of nucleotide data, so we need to do DNA sequencing. But we need that from lots of different organisms and we need to understand how they translate into proteins, we need to understand how those proteins act and function, what if they bind together, how they fold together, is an incredible number of pieces that need to come together to see that big picture. This is where scale becomes very important. It's a bigger problem than some traditional ML or even the original deep learning architectures are capable of solving, because it simply requires more parameters, requires more complexity, requires better understanding. NLP-based models and transformers in general are really good for this domain because a lot of what we operate on isn't sequenced space. But I wouldn't say that they're the only approach to this either. But those advancements in letting us get to larger and larger models to create the GPT-3 of DNA is something that really gives us, for the first time, a real handle on these challenges. Lukas: There is this trend in NLP — which I'm much more familiar with — of models becoming more and more black boxes. Less and less informed maybe by linguists. I don't know if every linguist I've had on this podcast would agree with that, but I think broadly as the data increases and the model complexity increases, they become more open. Is there a similar trend in these applications, where maybe the chemistry and physics matters less and you just treat it as this translation from letters to ""Did the drug get successfully produced or not?"" or do you still inject your knowledge of biology or chemistry or physics to make the whole system work? Greg: Yeah, it's been moving in that direction, but we're not there yet. Biology is...those two communities still haven't fully been united. There have been some big advancements recently in the protein-biology space, and the MSA transformer is a big example of this where being able to take something that bioinformaticians and computational biologists have been doing for years of aligning sequences to see what kind of patterns they share in nature can be used as an input directly with a special kind of architecture to let models learn from that. These sorts of biologically inspired architectures are still coming. AlphaFold is another great example of one where they did a number of relatively novel techniques and combining them together was really key to the success. The black box approach is powerful and I wouldn't downplay it, but we're still plenty of room for improvement. Sean: But I think that's ultimately where we want this to go. You can input in a target sequence and be able to have the output be the sequence for the drug candidate and predict all the binding just based off the sequence itself. We've already seen some really interesting discoveries that have occurred from...our deep learning model showed that we got increase in overall yields from this protein that wasn't necessarily classified as a chaperone, but our deep learning model predicted that it would be. I think these are some of the really interesting discoveries that are going to be occurring at a very rapid pace by bringing the AI and biology together. Lukas: Sean, how do you think about investing in data collection versus your ML team? There's maybe two ways to improve your models. Going out and collecting more data, which is probably really one type of investment, versus building up ML expertise. Do you think about it that way and do you feel like there's a trade-off there? How do you look at that? Sean: I think investments in both is absolutely critical. You can't invest in one and neglect the other. You really have to make the strong investments in both. Right now, a big investment of ours is, ""What is all the data that we want to be feeding in into the models?"" Looking out 10 years, are we going to regret not collecting this piece of data? Then how do we build our databases and scale the amount of data that's needed in the future? How do we collect it as quickly as we possibly can to then hand it over to our ML team to be able to continue to train and improve the models? We have made huge investments in both, from the wet lab side, the data capture, and the database and scaling that along with the AI team. Lukas: As more of a computer scientist, I'm definitely enamored at the idea of a wet lab. Could you describe what happens and what that collection process looks like? Sean: We just built out a, I think it was 88,000 square foot campus. Half of the campus is office space and then the other half is an actual lab. The lab is super key to what we do. It ranges all the way from the drug discovery team all the way down to our fermentation and purification team that grow up the cells and ultimately purify them. A lot of the data that we're feeding into our deep learning models is Next Generation Sequencing data and flow cytometry data. That's really key. Some of the breakthroughs within NGS and the speed at which we can process NGS data is really enabling us to do what we do. It's really fun to be able to grow a team that's both on the wet lab side and then the AI and ML side. Also, I would say an AI scientist that understands the biology is absolutely critical to what we do and the talent on that side is...there is not a lot of it out there, but we have done a really amazing job of building out talent that understands both aspects. Lukas: Maybe this is a stupid question, but what goes on in a wet lab these days? Is it like beakers full of proteins? Is it microfluidics arrays? I don't know. How does it work? How fast can you actually collect meaningful data? Sean: We build these...so we start off with building these large libraries. We work with what's called a plasmid. It's basically circular DNA and that encodes the drug product. We vary that DNA to look at various different drug candidates. In a single small test tube, we basically take all of those billions of different plasmids and put that into an E. coli. It's extremely small and you look at it and be like, ""Wow, there's trillions of cells in there,"" and it's pretty incredible. Then we take all of that, we screen it, and then ultimately we find the drug candidate and the cell line. Then we grow it up in big fermentation reactors. Think of beer and brewing beer. It's essentially big vats that are highly controlled and then you just grow up the bugs in there and basically give them the genetic code to make the drug candidate and then you scale it up from there. But yeah, it's all beakers, fermentation, purification. You name it, we've got it. Greg: I'd add a little color to that as well, in that from a background of somebody who doesn't spend every day inside the wet lab, it feels a lot like stepping into Wonka-land. You have an amazing amount of human ingenuity sitting on every desk, whether it's a mass spectrometer or some sequencing technology or...all these devices have very specific and very incredible capabilities and a bunch of people who know what to do with them and know how to put all the pieces together to make this stuff happen. Sean: It's so funny. I actually think I don't think I've ever had anybody ask me, ""What does a wet lab do?"" I was searching for the words to describe it. I probably did a terrible job. But it's like- Lukas: I thought it was great, what you provided. Sean: You don't really quite understand the magnitude until you step in and really understand every intricate aspect that's being done. Lukas: I remember the first time I ever went into one of our customer's wet labs. I felt like, ""Oh, this is what I thought science was like when I was a kid."" I love it. Greg: I'm still disappointed I don't get to show up as a lab coat. I might just start doing that now. Sean: Yeah. Lukas: It's funny. I never thought about this, but we do a lot of ML experiment tracking, but I would imagine there's a lot of parallels to tracking all the experiments that you're doing in the lab. Do you have software that does that? You've probably written a lot of software to just keep track of everything that's happening in there, right? Sean: We've actually decided to build a lot of this out ourselves and Jonathan Eads, who's our VP of Data Science, he and his team are actually working on building out a database where we track everything internally based off of the software that they have developed. This is really because there is no software solution out there that really met our needs. We actually just got a demo of it the other day and it's really incredible, what it's going to allow us to do. Not only in the data capture, but also being able to track where programs are at in the lab, where we have bottlenecks. I'm mean, it's really this brilliant software that is really going to help expedite what we currently do and to be able to capture the data that's needed for the long-term success. Lukas: Very cool. I'm curious about how you think about where this goes. Where do you imagine ML taking you as you collect more data? Do you think the whole process moves to this? Do you think you could run clinical trials essentially in ML and know if they're going to be successful or not? Sean: I won't say that we'll be able to run ML for clinical trials, but the drugs that we do design, if indeed we are predicting the best drug candidates for various indications, it's going to increase the overall success rate. That in turn is going to lead to shorter clinical trial timelines and being able to rapidly progress new drug candidates through, and ultimately lead to the point where we can do personalized medicine because we have shown that the success rates dramatically increase and allow for that personalized medicine. But who knows? We could here in the future be able to use ML for a clinical trial design and prediction as well. One of our core values here is believing in the impossible, so I feel bad for not saying, ""Yes, ML will be able to predict clinical trials and not actually have to go through it."" It'll be really interesting to see what's done on that front in the future. Lukas: What is a typical clinical trial success rate? Sean: Right now, it's right around 4%. Lukas: 4%. Sean: Yeah. Lukas: But there's different stages, right? Or how does that work? Sean: Yeah. There's three stages. You have your Phase I, your Phase II, Phase III, and then ultimately approval. So going from Phase I all the way through approval, it's about a 4% success rate. Lukas: Wow. Sean: Yeah. Lukas: Just as another CEO, it sounds totally harrowing to me to have my revenue depend on a 4% success rate process. How do you stay sane in a market like that? Sean: The way we structure our revenue is one, the pharma partner pays us to actually develop the drug candidate and the cell line. We're getting paid for that. Then we get paid on milestone payments as they progress through the clinical trial. You get a milestone payment at Phase I, Phase II, Phase III, ultimately approval, and then royalties. Sean: Even if a drug doesn't make it to the clinic, you can still get paid these milestone payments, which are 100% pure margin. Then it's a law of large numbers. It's just growing the number of programs you have as quickly as you can. You ultimately get to the point where you do get drugs approved and you get royalties coming in for 10 to 15 years off of that. But you grow the revenue base just by growing the number of programs every year. Lukas: Can you say order of magnitude how many of these you're doing? Is it like thousands? Sean: We currently have nine active programs ongoing. Our goal for this year is five programs, which we're on track for, and then increasing those year over year. But no, it's definitely not thousands. It's more on the tens instead of thousands. Lukas: Do the programs inform each other? Is this similar to natural language where you can have one big model and then fine tune it on the different cases? Greg: Yeah. That's actually a big part of why we think this is so exciting, is because it really is one physical system underlying a lot of these drugs. Creating a model that can understand this for one drug is useful. Then for the second one, it presumably will need less training data because it can transfer learn what it understands about the first one. Then you go to the third and the fourth, and before long, as Sean was saying, the number of shots you need on goal becomes reduced to the point where any novel drug then becomes a one-shot learning problem. This is exactly where we see it going. Lukas: Is it possible for you guys to engage with the academic community at all? I feel like you're actually adjacent to two very different academic cultures, right? There's the ML culture, which I know well, but seems like it might be tricky to share data with and then the vast medical literature, which I know less well. Are these communities relevant to you at all? Do you try to do any publishing or engage in some way? Sean: Yeah, definitely. We love to engage in the academic community and we are looking to publish some papers here in the near future, both on the work that we're doing, but also in collaboration with some of the leading new academic professors in our area. We see this as ways to continue to validate the work that we're doing and improve the science that we have and leverage domain expertise that we don't have. The academic community for us is really essential to the work that we do. We very much foster those partnerships and collaborations. Lukas: Cool. Well, I know a lot of ML practitioners that I think would be interested in working in your domain. Can you say anything about what you look for in hiring an ML practitioner that might be different than, I don't know, a Google or an OpenAI? Greg: I can speak to some of what we've looked for on our team and what we continue to look for going forward. There's a lot of the strengths that naturally come from the AI community that we like to keep going forward. The way that we think about problems, the way that the... how we understand the implementation details. As you know, AI can be tricky to execute on both the compute and the setup and understanding all the different systems and software that goes into that. But on the totally different side, you have all the biological complexity and it's an entirely different field to be learning...you need a whole other degree to learn about all the complexities that come from that. Lab scientists and the close relationship with them is an important piece there. I guess what I'm trying to get at is that it's that capability to learn, because there's so few people who naturally are in both spaces anyways. So it's a capability to learn, the patience and the rigor to go through and understand all sides of the problem, and how to make an impact therein. It's never as easy as a lot of AI problems often are where it's like, ""Here's your inputs, here are your outputs. Now, maximize some scoring function."" It's a lot trickier than that. The scientists live that day to day. To some extent, it's like, ""Well, welcome to our world."" And that's great because it means that when...we can also say, ""This is how AI can address these challenges. It can help clean up that noise. We can help better understand what's going on with this process, and then, yes, ultimately build systems that speed up and maybe even replace a lot of these processes."" Lukas: Sean, I guess in that vein, as you have transitioned from not doing a lot of machine learning to really making this heavy investment in machine learning and building out these teams, have there been any kind of unexpected cultural issues or team issues that you've had to work through that might have happened because of adding all these ML nerds? Sean: Yeah, I think that it's having everyone recognize that by combining both ML with biology and the lab scientists, that it ultimately is getting to our vision quicker and that it ultimately is impacting patients' lives in ways that we couldn't do without combining it together. I think the first thought is, ""Oh, my gosh, Sean, you're bringing in all these AI and ML experts. Are they just going to automate my job away and they're going to be able to predict everything and there is going to be no need for me?"" It's like, ""Absolutely not."" Biology is so complex. We have so many problems to solve. Once we solve one problem with AI and we have the data, we then need the biology and wet lab expertise then to solve the next problem and the next problem after that. It's never going to go away. You need both. At the end of the day, you can't stop the wet lab and the biology side because that's what feeds the data and both are absolutely critically important. I just love the different perspectives that both sides bring to the table to make our company the best it possibly can be. Lukas: It sounds like a lot of fun. Have you gotten any questions from your ML team where you're just like, ""Man, we're just miles apart here,"" like you just don't understand what we're doing? Sean: No, I think honestly everyone has really done a great job of understanding the other side's perspective. Sometimes the AI team may not be getting data as quickly as they would like, but then they dive in with the scientists and they're like, ""Oh, I understand you ran into this problem. Can we work together to increase the throughput?"" Or it's like, ""Hey, I gave you all this data. I'm not seeing any improvements yet. When are we going to start seeing improvements from our AI models?"" I think it creates patience and collaboration and, I think, a respect for each other's part that they play in the overall bigger picture. Lukas: Greg, do you agree with this? Should I ask you separately? Greg: No, no. I think you nailed it. You started by saying it's exciting and I couldn't agree more. It's an opportunity of a lifetime to be at the intersection of something like this. It's wonderful to see such smart people and such talented people who are respected in their own field and then coming together. There's something very humbling always to be on the other side of things and realizing, ""Wow, there's always more to learn."" It's very healthy, as Sean said. It does give you a greater sense of context and perspective. Lukas: We always end with two questions, and I think you both are coming from super different perspectives, but I'd love to hear both of your answers to this. One question we always end with is what's a topic in ML that you feel is underrated versus its impact? I mean this very broadly. I mean, I guess, Sean, what skills do you feel like people should be showing up with that they're not, maybe? Sean: When folks come to Absci, we're solving very big complex problems. Our mantra and our number one value is there for a reason, which is believe in the impossible. We are always looking for people that are wanting to push the limits on both the AI side as well as the biology side and really bringing that together. We are creating this new ecosystem that really hasn't existed and this understanding of what ML can do for biology and vice versa. We just want to bring in people that want to think about things differently and change paradigms. I'm super excited about where the future lies with AI and biology together and we're really on the forefront of that. Yeah, couldn't be more excited about where the industry's headed. Greg: All right. Yeah, I guess I'll give my different take here on what's the underappreciated side of ML. I'd say that it definitely has some appreciation, but could be higher, is the capability of deep learning and artificial intelligence to do integrative work. We see an awful lot of research solving specific problems, often hard problems, and they compete against each other on performance scores and evaluation. But the real value, I think, in the practical world for AI is how well it ties different kinds of information together. We use this at Absci in trying to collect dozens of different kinds of assays and we can understand, ""All right, in context for just one of them, this is a spreadsheet of data. It's not even that large. But maybe if I relate that to the embedding space projection of a different model that was trained on a different task, it can tell me something useful about the problem that I'm working on now."" This is a philosophy that we're big proponents of, of integrating large multitask systems that can leverage the commonalities in the data and understand...putting them all together. This is an advantage, not just that you get to use all your data on hand, that you get information on that, but it also creates a simplicity to everything where instead of having to run all these different pieces, you can ask from maybe one piece of data what the other pieces would look like. You can take a lot of what might be, let's say, in the case of bioinformatics, we have a lot of computational tools for understanding protein function. You can run dozens of these different tools and try to get them all to work together and set up your environments or you can have one AI model that knows these answers and can give it to you in a millisecond. Full appreciation of how well it can simplify problems and bring different kinds of problems together is something that I think could use more appreciation on. Lukas: This really works for you. I mean, I feel like a lot of people talk about this multitask learning and combining problems, but it's always felt a little theoretical to me. Do you actually find that it meaningfully helps on tasks to incorporate data from other tasks? Greg: Oh, absolutely. This was a big part of what we did at Denovium, was with taking our DNA models and protein models, tying them together, two entirely different domains of data. But it allowed us to...users could essentially take a DNA sequence and then just one artificial intelligence model, find all of the proteins, what do they do, characterize them with 700,000 different labels. Very multitask. We had something like 25-some odd different databases that were all tied together in different...it essentially had to multitask quite a bit to solve those challenges. But it both worked and it really sped up the progress of what we could do with it, as well as allowed some really unconventional approaches. So Sean was talking earlier about the chaperone discovery work where we could use these protein models to understand what a protein would do if it otherwise hadn't been understood by science. These sorts of models, because they're generalized over so many different kinds of tasks, were not burdened with memorization and they can say, ""Oh, yeah. Well, hey, look, this looks an awful lot like this. It should do this,"" and we can trust it to step outside its box. Lukas: Is there any paper or something that you could point people to who want to learn more about this? Have you been able to publish any of this work? Greg: There is a legacy work that was somewhat of a precursor to it. We can pull up the paper later. Lukas: That'd be awesome. Greg: Yeah. Lukas: Cool. We'll put it in the notes. Lukas: Our final question, and Sean, I'm really curious to get your take on this one. Sean, you've been super positive about the promise here, but you guys are actually doing ML and trying to get real results, and so I'm sure that you're running into problems. What has been the biggest unexpected problem trying to go from this idea of something you want to do to actually making it really work in reality? Sean: Oh, man, there's problems every which way. I would say first, it's actually convincing the scientific community and our partners that deep learning and AI is the future and showing them work and showing that this can actually happen. That's first hurdle. Then I would say, the other biggest hurdle and challenge that we've had to work through is being able to develop the technologies that get us the data — get us the data in a clean format — and then scaling that data and then building out a world-class AI team. Greg and Ariel and myself with Matthew, we're always looking for the best talent and how do we bring them in. But as you know, as a fellow co-founder, it's like once you think things are going well, you're always thrown in off the deep end and going in another path and having to solve another problem. It's continuously problem solving it, but that's the fun of it. We've made so much progress and we're going to continue. I think that's just so much of the fun of growing a company and doing what we do. Lukas: Greg, anything you want to add of unexpected hurdles along the way? Greg: Unexpected hurdles? I mean, that's every day. Lukas: Well, give me one. Give me one real story from the trenches. Greg: Oh, let's see. What's a good one that we've discovered recently? It's always getting back to the fact that biological data is messy and a lot of scientists are exceptional at what they do, but things come back that you're surprised at. For example, we assemble these plasmids, these long stretches of DNA that are a circle that can essentially convey various information about how to construct the drug and how to manufacture at scale. A lot of the technology that we're developing is trying to say, ""Okay, if you put in this sequence, it will do this. If you put in this sequence, it'll do that."" In the process of building the precursors for that — I'm not going to credit deep learning here, just credit the infrastructure development underneath that — we discover, ""Oh, hey, in some of our assays, whole sections of the DNA have just been cut out and have been looped together into a smaller shape. What's going on with that?"" This was nobody's plan. Your AI is not going to say, ""Wow, that was a really interesting phenomenon. You should go..."" These are the sorts of things where it is that collaboration environment where an AI scientist can, even just in the process of getting things ready for ingestion to an AI, can really make sure that all the data is together and understood and a lot of these things are overcome. Then, of course, on top of it, now you get the insights of, okay, now for the ones that are together, what do we see here? What is interesting? Sean: I think it all goes back to the hardest part that we deal with, is the biology. We can predict these billion-member plasmid libraries to build, but it could take us a week to build to it or it could take us two months depending on the complexity of it and we just don't know because it's biology. It keeps it interesting. Lukas: Well, awesome. Thanks so much for your time, guys. This was really fun. Really appreciate it. Sean: Thanks so much, Lukas. Greg: Thank you. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out.",7865 +"Chris, Shawn, and Lukas — The Weights & Biases Journey",https://www.youtube.com/watch?v=Dzu3WJMjdaM,2971,2021-11-05,"Chris: We had presented to OpenAI, to a group of researchers. I remember I was presenting and half the audience was just looking at their laptops. I left that meeting feeling like no one cares. Then a week later, Woj calls us up and he says, ""Hey, we got this problem on the robotics team. Come check this out. Can you help us out?"" We go, look, and I remember going there with Shawn. Both Shawn and I are looking at it and we're excited, right? Because we can solve the problem. They're telling us they have a problem and they want us to fix it. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. This episode is a fun departure from our normal format, where I interview two people in the machine learning space about their work. But these two people happen to be my co-founders of Weights & Biases. Shawn Lewis, our CTO, and Chris Van Pelt, our Corporate Vice-President. I talk to them about questions that have been stewing inside of me for years, like ""Why did we start this company?"" and ""Where is it going?"" Honestly, I was surprised and educated by their answers and I hope that you enjoy listening to this episode as much as I enjoyed asking them these hard questions. Lukas: All right. Here we go. Should we just jump right in? How did this company start? I ask this because it's the most common question I get asked if I go on any other podcast or with any candidate. Everyone wants to know how did you start the company. I was kind of realizing you two both have founding stories that's telling the same story, but it may have diverged in its evolution. I would be curious to hear you two's version of the story. Which of you wants to go first? Chris: Shawn, I think you should go first. Shawn: All right. When I started my career as a software engineer, I was at Google and I was in the platforms team at Google. I joined...this was back in 2006. I worked on all kinds of stuff in the platforms team. The platforms team was responsible for building the data centers and all the machines inside the data centers that Google uses. I wrote a lot of the software that ran in those environments. When you write tests like that, they generate a ton of data. At first, I was writing tests, but then I ended up with just all of this data. Really, where I ended up spending my time was on making tools and data pipelines that would help us understand that data. That's what I realized that I loved doing. I ended up writing all kinds of different tools and data pipelines — and the data's in all these different databases all over Google — and I merged it into once place, and defined the metrics that we used to understand hard drives. And then made these user-facing tools — and when I say user, I mean other people at Google — that they could use to dig in and understand this data. I worked on a lot of stuff like that at Google. It was super fun. Then I eventually left. Then I started this other company called Beep. We built hardware. We built a lot of things. This is kind of tangential to the story so I won't go into it, but we were also in Y Combinator. In that Y Combinator batch was actually Lukas's wife, Noga, who I'm not sure if folks know this, but she's the founder of a company called PicnicHealth. We were in a Y Combinator batch, we ended up sharing an office together in the Mission in San Francisco for a couple of years, and I got to know Lukas that way. We had a hardware component at Beep. We had this hardware lab. Lukas loves robots and he was kind of always tinkering in his garage building robots. He would come up to my hardware lab and poke his head through the door and go, ""What are you guys doing in here? Also, my robot's broken,"" and we'd just start to get to know each other and work on the robot and help him fix it. So we became good friends that way. As Beep wound down — I don't know, I think that was 2016, I think, this is my version of the story — Chris and Lukas were starting to step out of Figure Eight around the same time. The circumstances were really good. Deep learning was starting to take off. We were all really excited about it. I had been thinking about these problems a lot. I loved the chance to work with Lukas and Chris, and we just started hanging out, and talking about it, and jumped on that opportunity, and started building the stuff that I needed to organize that data. It was an exciting time and it still is. Lukas: All right. Chris, I think you have a different version of the story. Let's hear it. Chris: All right. We got to go back. We got to go back to like 2006, like Shawn. Except in my 2006, I'm coming to San Francisco for the first time and it's to work at an exciting startup. I feel like I've made it. I've been doing web development and trying to advance my career and happened to be into Ruby on Rails, which was really hot and exciting at the time. There was this startup in San Francisco that was using Ruby on Rails and using machine learning to create a smarter, hopefully more relevant search engine. That startup was Powerset and I came up to San Francisco — it was like the beginning of 2007 — and that's actually where I met Lukas. Lukas and I joined Powerset at roughly the same time. I was in the product team interfacing with a whole bunch of the other backend teams to try to create an interface to this exciting new tool. Fast forward about a year or so, and Luke and I decided, ""Powerset has been fun, but I think it's time for us to give our own go at creating a Powerset or a startup."" We set out to make crowdsourcing more accessible to the enterprise and people that wanted to do it to collect training data to train machine learning models. Way before machine learning was as cool as it is today. Yeah, fast forward about 10 years through creating a startup with Lukas, learning a lot along the way. At the time that we started Weights & Biases, Luke and I's day-to-day responsibilities at the company were winding down. We were asking ourselves, ""What is next?"" I remember going to Lukas's workshop, where most of his podcasts are recorded and playing with robots. I was really into Lua for a little while there, thinking we could make a cool Lua toolkit for robots. But then it was really Lukas having an internship at OpenAI and actually building models with some of the world's most renowned machine learning researchers and needing tools to help him get his job done that was the initial itch that we were scratching. I love building tools. Luke's like, ""Hey, I need a tool to help me build models at OpenAI."" I said, ""Great. Let me try to whip something up,"" and just really poured myself into making a very early prototype. Then shortly thereafter, Shawn came into the picture, and I am forever grateful. Lukas: Awesome, man. I was getting some messages from Lavanya to pull you both back on the rails while you were talking, but I love the extended cut founding stories. I think mine is like a sentence or two. Sorry, Lavanya. Now she's texting me what the... But, no secrets here. Lukas: Actually, this is a good segue into another question for you. Shawn, you've been out on paternity leave. Chris, you've been talking to lots of customers independently from me. People keep asking me this, they're like, ""What is the architecture of the Weights & Biases server?"" I try to describe it and I realize I honestly have no idea. I know there's MySQL involved and React. Can you give me the several sentence lay of the land? Say I'm an engineering candidate and I just want to know what are we using, how does it all fit together. Chris: Okay. All right. We got a single page React application that is our front end. It's just a lot of JavaScript. We load that up into the browser and then it makes requests against our GraphQL backend, which happens to be written in Golang. When a customer wants to run Weights & Biases themselves, we actually deliver all of this — the single-page React app and the GraphQL backend API — in a single Docker container that they could run within their Kubernetes cluster or in a managed Terraform-based deployment that we support. Then the backend persistent stores are super simple. We've got MySQL and an S3-Compatible Object Store or Azure Blob storage or Google Cloud Storage. There's a little Redis in there, but customers generally don't have to worry about that. Shawn: It's nice because from early on, we knew that having the potential to go on-prem was really important for our customers because of data privacy concerns, because these datasets are so valuable and have all these other privacy concerns. We really just kept it simple to make that possible and, yeah, those were good early choices. Lukas: Shawn, you wrote a document at one point that was, ""What it would take to be a billion-dollar business?"" for Weights & Biases. I thought maybe we could pull it up and compare it to what it actually took to be a billion-dollar market cap business. Looking at this document, how do you feel that things have played out? Has anything played out differently than you expected? Shawn: The core of the argument was, ""We're not sure if we can build better products than everybody else in the space, but we can raise a lot of money."" We know we can do that and we are well connected to lots of customers because of Lukas and Chris's background with Figure Eight. So it was very easy for us in the early days to go into a customer with very little to show. We had a demo of the early parts of the product and have good conversations. That's really what it takes to develop good products, is to actually interact with customers who look at the product and give you feedback on either a demo or by actually using the product. The argument was we should rapidly expand into the different parts of the ML pipeline in parallel and leverage those connections and the ability to raise money. We could build a team that could build products in each of these spaces. I would say we didn't quite do that. I think we still have this goal of expanding across the ML pipeline. This early theory that maybe we're not better, we may not be great at building products...I would say this is maybe not so humble of a statement, but I think we built something that users really love and we definitely did it by hiring great engineers and great product people and by talking to these customers a ton and spending lots and lots of time and just having this relentless customer focus. But I also think that somehow at core, there's this magic somewhere in what we're doing at W&B and that we do understand the space and the customers and we turn that money and customer connection into great products. So back then, I was thinking, ""Well, we have great products,"" and looking back now, I feel like we really do. That's really been a cool journey. Lukas: All right. I have a question for both of you I was wondering about. Was there a moment where you felt like the business was really working or the company was really working? Can you think of a time when you really suddenly felt like that or not? If so, what time was it? Chris: I think there's a couple times, but one that really stands out for me is driving down to Palo Alto or Mountain View or wherever we were meeting, and one of our first deals was with Toyota Research Institute. I remember sitting out. I had grabbed lunch with Ari, our first Account Executive. I grabbed lunch. We knew we were going to go into this meeting and present the number that we were going to sell our software for. I had this thought of, ""This moment is really important."" It's scary. When you're brand new, you don't have many other customers and you go and you say, ""Hey, we want to charge this much for our software,"" I felt like this is make-or-break. We went in. The meeting went really well and we ended up closing. TRI is one of our first customers, which is great. But after that, I mean, I wasn't like, ""Okay, now we're set. Next week, we can get to a billion dollar valuation,"" but that first customer was really big. Shawn: Yeah, that was a big one. As a founder, or probably anybody at an early stage startup, you say, ""Oh, if we can just do this one thing, then we'll be sure that we made it."" Then the next week, you're back to work, and you have to grow some more, and you're always looking at the next thing. But that first customer was a great one. I think for me, what comes to mind is — I think this was maybe around the holidays two years ago — in the earlier stages, even up to, say, 15, 20 people in the company, you kind of have a pretty good sense of everything that's happening. Every deal, I remember being a part of in some form. But there was a moment around the holidays a couple years ago where we have these Friday meetings where we get together and talk about how the week went. It's everybody at the company and somebody says something great that they did. In that particular meeting, I just remember there was somebody on the sales team who had made another sale to a customer I had never talked to. Somebody in the growth team that had found a new growth experiment to do and executed it and actually made numbers change. Somebody in the product team that had done something that I didn't even know they were doing. All of those things came together in one meeting for me. That's when I felt like, ""Wow, this company is a lot more than...I can't wrap my arms around it and push everything forward anymore. There is all these great people around me that are doing that."" That is an amazing feeling because from there, you add more and more great people and the company continues to go in a good direction. It's bigger than yourself. Lukas: Another question I had for both of you is, is there a favorite feature in the product that you feel really proud of or some way that the product works that you feel like is uniquely great? Chris: For me, it's got to be the command line interface. I think it's a very under-appreciated interface to our product. In the early days, I spent an obsessive amount of time on making a whole bunch of command line commands and making it work nicely in Unix. I was piping stuff at one point. I think we've since decided that's not the way most people want to interact with our product, but as a Unix nerd, that's my favorite part for sure. Shawn: Maybe something that people don't really see is, there is this layer in the frontend that we made. In W&B, you've got all these different charts on the screen. The architecture of the frontend is that each chart can individually make its own network request to get the data that it needs to show. Those are actually heavy requests because there's millions of data points sometimes if you've logged a lot of data. So we built this cool layer in the frontend, it's like a middleware that watches all of the requests going out. It aggregates them all together so it has a little time delay. It says, ""Give me all the requests that happened in the last hundred milliseconds."" It does lots of cool handwritten optimizations to figure out how to merge certain kinds of queries together and get all the results at once in a single request and then give that back to the user. A lot of users, you have no idea that's going on, but when you're building UIs, you really want it to be snappy and fast. I hope that I'm not shooting myself in the foot. I'm sure somebody will have a story of Weights & Biases not being snappy and fast. If you do, send it my way and we'll get it fixed. But there is this massive amount of engineering effort that went into that chunk of code to make sure that charts with millions of data points all on the screen at once can be updated really quickly. Lukas: Just so people don't think I'm only asking softball questions here, this is something that a couple of candidates have asked me about recently. Have there been product or engineering efforts that we would do a lot differently in hindsight? Chris: I mean, I've got one that isn't bad, so it's not fair. When we first started the company, I wrote the backend in Python. It was Python 2.0 because I wanted to use Google App Engine to start things up so I didn't have to do a bunch of DevOps stuff. That quickly stopped scaling, especially because we have a GraphQL backend where things need to happen in parallel. Our first engineer Tom actually rewrote that entire backend in GoLang. When you do a big rewrite like that...we already had TRI as a customer, OpenAI was a heavy user. We had to keep the site up while this was happening. Usually those kinds of exercises can go sideways. You can start to take way longer than you would have anticipated or the ultimate project wouldn't have worked out. That was an example where Tom actually got the project done ahead of time and we're still running pretty much the same backend code to scale up to the tens of thousands of users that we have every day now. It's not a decision I would go back and say, ""Not do."" Now that I had a good experience doing a rewrite, I certainly wouldn't say we always do a rewrite, but you don't hear it often, our choice to do the rewrite was definitely the right choice and it worked out really well. Lukas: What customer feedback has surprised you the most? Shawn: I have an example from early on when we started the company, which was...the very first version of Weights & Biases was more of a command line tool around saving data. There was a little bit of UI. Chris had built this Python library and this UI. The UI was essentially, what it did was it let you log in and set up a place to store the data that the command line tool was saving. But it didn't do a whole lot more than that. I remember Chris added a feature that would just kind of...it also collected the outputs of your training runs, like the logs. Chris added a feature that would look for the specific Keras metrics that it printed over time and he just made a line chart of that. Of course, early on in the startup, what you do is you have a demo or a thing that works you go to customers and you show it to them. All the customers were like, ""Yeah, command line tool. Okay, okay. But what's that chart?"" They would really focus on that chart. The reason it was surprising to me is because these are programmers and data scientists and people who are really comfortable in Matplotlib and Jupyter notebooks. A thesis you might have is, well, data scientists don't need a tool to create a bunch of charts in their browser because their use cases are going to be so different and they're just comfortable doing it themselves. It was really a big surprise to me that that was the main thing people focused on. We saw that and we said, ""Well, let's follow what users want,"" and we did that. We kept building the UI and making it better and better, and we were able now to have a generic UI that solves lots of different kinds of use cases. But it was surprising that that would be possible for this kind of user. Lukas: That's a good founding story right there. Chris, what about you? Chris: The most notable user feedback that is top of mind is an early user, Hamel at GitHub, was a heavy user of the tool. I remember one night, Hamel wrote in and said, ""Hey, we really want to log HTML,"" and we were actually able to ship that feature in that same night, which was delighting to Hamel. But Hamel was also...he did not hold back in telling us where we were not being excellent in the UI and was very honest about some pretty serious issues with the system at the time. I remember it breaking my heart as a founder. Like here, I've got someone who is engaged and excited about us, but he's getting frustrated by using our tool. It's the worst possible thing I could imagine. The team really focused and did a lot of hard work to redesign and re-engineer a lot of the problem interfaces that Hamel was running into and ultimately I think it really helped us make a better product. Lukas: How do you think the last four and a half years of running this kind of hyper-growth startup has changed you as a person or changed your perspective on the world? Chris: I remember when we first started the company, after Luke and I had been working at CrowdFlower for 10-plus years. I was just so excited to have a blank canvas. It was like, ""We can start fresh."" There is nothing legacy we need to support. It's just green fields. I tend to think of the company stuff, the process and the management, and all of the things that you need to do to make a company work, historically that didn't interest me that much. I think something has changed, especially with this company. Now I find those things more interesting. Being able to step away from just hacking all the time and actually think about, ""Okay, how do we build a culture and how do we mentor and work with the team to ultimately build a better product?"" I think those problems are much more interesting for me this time around than they were when we were running CrowdFlower. Lukas: Another question I had is, what do you think has changed around us as we've been running this company? Do you feel like customers are different now? Do you feel like the industry is different at all? Shawn: We started the company a year or two within when deep learning was first...when AlexNet was first trained or when AlexNet actually showed real results. We really focused on deep learning. I mean, one of our first customers was OpenAI. That's well-known. They're still a great customer of ours and we spend a lot of time building things that were tailored to OpenAI use cases. When you start a company, it's good to make a bet on what you think a growing market will be because you don't want to go into...you can do this, but you don't necessarily want to go into a big established market and just fight with Google and Amazon. It's better to focus maybe on something that's smaller and they won't spend all their resources fighting you on, and then the market grows to the point along with you where all of a sudden, you're this billion-dollar company. You can't just do that. There's some amount of luck, for sure. We had very good timing in starting Weights & Biases. That's a really cool feeling and it's really cool to ride that trend. In doing that, what we've seen is deep learning really took off. I mean, it's applied in every vertical now. Every company has at least a few people who are building deep learning models now. Those teams are constantly growing and we see that in the way that our contracts grow with our customers. We sort of bet that that would happen, but to see it actually happen and to be able to ride that trend, there is no way to really feel what exponential growth is until you're in the middle of it, and that's what it feels like. Chris: I remember one moment early in the company that stands out. We had presented to OpenAI, to a group of researchers. I remember I was presenting and half the audience was just looking at their laptops. I left that meeting feeling like no one cares. Then a week later, Woj calls us up and he says, ""Hey, we got this problem on the robotics team. Come check this out. Can you help us out?"" We go, look, and I remember going there with Shawn. Both Shawn and I are looking at it and we're excited, right? Because we can solve the problem. They're telling us they have a problem and they want us to fix it. After that, Shawn and I, we pulled an all-nighter just cranking out the interface that they wanted and got it to them within a couple days. I remember thinking, ""How precious is this relationship with OpenAI, this institution that I really, really admire?"" Also, that same feeling that Shawn described of some users saying, ""Hey, I have this problem,"" and we had the power to go back and actually fix that problem for them. Lukas: Do you remember the afternoon when they turned on Weights & Biases? Shawn: That was another all-nighter. Chris: Yeah, it turns out there was some performance problems with my Python backend if I'm recalling correctly. Shawn: Well, there were a couple things. We did not anticipate OpenAI's scale, because we're doing the thing that you do as a startup, which is you make an MVP. It doesn't really need to scale. But it turns out our very first customer was one of the largest-scale customers we could have. They were the first person who integrated Weights & Biases into this library that they had that everybody on their robotics team was using to run training code. As soon as they committed that to production, they started sending us a ton of traffic. The site just immediately went down because it's this cobbled together startup website. Actually, the first problem was there was some API limit that we hit on Google because of the way we were making a specific request. Chris might remember what it was. There was no resolution. You can't just call Google and get them to immediately change a limit for you. You actually have to wait for a support case to go through for a number of hours. Of course, now, maybe we're a big enough customer of Google that we could have some influence, but back then, we were just this little startup and we can't go, ""But it's our first customer."" That doesn't really sway anyone over there. I don't remember how we worked around on that. Do you remember? Did we just wait? Chris: We had very smartly designed the Python library to back off when things started failing, so I think the quota resolution got resolved within the retry timeout. Shawn: That was one problem. Then later in that evening, of course, we're pumped because we have all this data coming, and it's OpenAI, and it's our first customer. Some other problem cropped up. Again, you might remember this, but it was something where Chris and I, we were up until 5:00 AM that night and Chris was live patching our app engine code. It was Google App Engine at the time, which is this Python auto-scaling platform that's not used as much now. I remember, yeah, we came up with a plan and we live patched the thing. I was like, ""Is this going to work?"" It did and the traffic started coming in clean and we could see all the data. We were so proud to have our first customer. Then, of course, the next morning, we went to talk to OpenAI and they didn't even notice the hiccup. They were like, ""Oh, yeah, cool. Thanks."" But, I mean, really, it was working when we went to have that conversation and they started looking at the sort of charts that we had and started giving us that feedback. That's when we got into the feedback cycle. It's important. In the early days, if you have any customer at all and they have a problem, stay up all night and solve their problem. Even if they don't notice it, it's worth it for you to start getting that great customer feedback loop. Lukas: What were your darkest moments, specifically? Chris: Early on when it was me and Ari, that was the sales team. This was before the pandemic, so we would fly wherever we needed to fly to. Some might think, ""Oh, you get to fly to Toronto? That's got to be great."" It's not great. You fly and then you go to some hotel and then you go to a meeting where you're just trying to get people to engage with you and learn about the product. There were a couple months there very early on where I felt like, ""We can't charge enough for the product, people don't see it as being valuable enough."" It can be very demoralizing, especially when you're out there on the front lines of sales and trying to educate and teach people about these concepts that are literally being created as we're iterating on the product. Shawn: I think a dark moment for me is — we already talked about this — but when we were first trying to sell to GitHub, we had this user, Hamel, who we talked about earlier. He gave us very direct feedback about how our product sucked. He was totally right. I love building this product and even now, it's hard. I'll take it personally for sure. Even though I know that some of the decisions are bad or there's lots of things that could be improved. When somebody calls it out, it definitely hurts. But we really want that feedback. I'm happy to go through that rollercoaster of emotions to make a better product. Really, that feedback led Weights & Biases to the place it is today. You have to be willing to accept there is lots of bad things. We want to know what they are. We made those decisions. It's my fault. It's Chris' fault. It's probably Lukas' fault and Lavanya's fault to some extent. We can always improve stuff. You take those gut punches in stride and keep making it better. Lukas: I have this memory — I wonder if this is an accurate memory or you see it the same way — but I remember thinking of making experiment tracking and doing an offsite where we really built something super custom for TRI, and then going there and showing them a beta of the experiment tracking stuff that we built and having them basically tell us this isn't that interesting and that feeling bad. Then I have this memory of talking to you, Shawn, and I think you were like, ""I don't think anyone will pay for experiment tracking."" I was kind of thinking, ""Yeah, you're probably right."" Then I remember talking to you, I was like, ""I just need to tell you that I need you to be more positive,"" which is so funny because I feel like actually you're almost always the optimist. I remember at least thinking to myself, ""What I need to communicate is I just need you to be positive even if it's not rational to be positive here because I'm feeling a lot of doubt myself."" Is that an accurate memory? Shawn: Yeah, yeah. I remember. I remember that. That was before we had that user, that first user who was actually using the thing, and we felt like they were getting value. We kept saying, ""What if we build this other thing? Will that get us a user? What if we build this other thing?"" We did that for a number of months. It was the early stages of the startup. That was disheartening because it's like, okay, if we could build this other feature tomorrow, but nobody's going to care. That was kind of the mindset I was getting into and you were rightfully calling me out on, well, this is the early stage of a startup, so let's make the next thing until we have that user. It was getting that first user who broke that off. From there, it was, ""Hey, I just need this one little tweak. Great, we'll do it,"" and it was all positive. Maybe not all positive, but more positive from there. Chris: All right, Luke. If there are other entrepreneurs listening to the podcast and wanting to build a startup that achieves a billion-dollar valuation, what advice as a startup CEO would you give them? Lukas: I feel like the advice probably depends on who that person is. Let's picture someone. Who are you thinking of here? Chris: All right. It's someone that looks a little like us, right? They're programmers. They're interested in starting a company, but maybe don't have a ton of experience on the business side of things, but they're passionate about the product they're creating. Lukas: There's so much advice out there that I think is really good these days. I feel like when we were all starting our companies the first time, being an entrepreneur wasn't a thing. Y Combinator's put out so much good stuff that's really...you forget how not obvious it is to people that you need to make something that people want, right? You can't emphasize that enough, right? I feel like now people kind of know that, which is fantastic. It definitely wasn't obvious to everyone when we were starting or maybe as obvious to us as it should have been. I think the thing that people don't talk about as much as I think they should or the advice that I feel like I can uniquely offer, because it's worked so well for me, is to pick a customer that you really love spending time with. I feel like a lot of these ML startups especially, they totally start from a technology and what's interesting to do with it. That's a bad idea. Everyone kind of knows that's a bad idea. Then they work backwards from a use case that they find interesting and that's maybe...it's an okay idea. The thing that gets lost is that, at least for me, the thing I do as CEO, the thing I have to do all day long is spending time with customers, spending time empathizing with customers, thinking about customers and bringing the customer voice into the company. Given that that's, I think, maybe the most important job as CEO, you should pick a customer that you really like, right? Because you're going to spend so much time with them over the entire arc of your company. Having a specific idea of who that is and making sure you like them, I think, is a really key thing. I remember at CrowdFlower, we tried to sell into different types of customers and so I really felt this. I went to CMO conferences. I contrast that for myself going to NeurIPS and just really enjoying making small talk, enjoying all the details, the things that people say. I also believe it's very powerful for the world and good for the world, but I think even more than that, just on a day-to-day motivation...the impact will sustain you over the long-term, but over the short-term, I think I really appreciate that I'm working with a user base that I really care about and enjoy talking to. Shawn: Lavanya's in the chat here. We're taking questions from the audience. Lukas: What's the hope for Weights & Biases in the next five years? I feel like almost it's like a jinx to say that question. I don't even know if I have a good answer. Maybe if you guys want to try first? Shawn: Of course, I take it from a product and tools standpoint. I hope that we can build interconnected tools across the ML pipeline that really work well together because they share these common underlying threads or infrastructural pieces or, dare I say, bones? Which is what I like to call them internally and everybody makes fun of me for. To me, it's really important that the data that you collect about your model in production can be used to inform decisions that you make back in the data collection process and the training process. There's so many parts of the ML pipeline, it's hard to build all this stuff. But if you think of the best companies, somebody like Google who's building ML, they've built all this. They've verticalized all of it internally and built it themselves. They've built tools out of other tools. I want to be able to make a platform like that outside of a giant company and give it to the rest of the world and use all the use cases that we encountered to make it better and better, and more general. Lukas: I think it'd be really satisfying if Weights & Biases becomes a core part of every ML team's infrastructure and we're really known for making really high quality stuff, really useful stuff, really powerful stuff. I think we're on that trajectory, but I think ML is growing so much that that becomes every company when every company has an ML team, which it seems like we're headed to in the next five years. So I think that's the biggest thing. Shawn: If you imagine the company that is in that position, what does it look like internally, I think we have this today. But, I mean, there's folks building data tools and ML tools at lots of companies in the world and doing a great job. I want to get all those people in the same place. People who love building these tools. Maybe there's folks out there who work on these tools and don't really love it and want to switch to something else. That's great too. I mean, you should move through things in your career. But I would love to be surrounded by people who really love this problem and really love the people who work on this problem, these customers, just all together, all of us. Maybe not in an office in the modern world anymore, but distributed around working on these problems together and just building great stuff that users love so they can build ML models that make the world better. Lukas: Here on Gradient Dissent, since you guys are both mega fans, you know this, but for new listeners, we always end with two questions. The penultimate question is what is the most, or an underrated topic in machine learning? Something that you would love to work on if you weren't working on Weights & Biases. Chris: l want to make a painting robot. But it uses an actual brush, okay? Not just a plotter or something. It's going to be very complicated. That's not the big world-changing answer maybe you were looking for, but that's what I would be really stoked to pour a year of my life into I think. Lukas: Like a paintbot? Chris: Yeah, a paintbot. That's right. Lukas: Cool, cool. Shawn: Underrated. Underrated topic. Maybe we should have started that robot company... This question's funny because we touch so many of the problems in machine learning at Weights & Biases. Not all of them, but we're building tools that are used for all the verticals, so of course we're going to touch them. So maybe I'll say something that we're not explicitly doing, that I think is really important, and I feel like I am actually doing, which is...I think model understanding is critical in the future. Deep learning models are really tricky to understand. It's a research area. It's really dependent on what kind of model you're building, what techniques you might use to figure out ""Why did my car make the decision that it did at a specific point in time?"" I think as these models get more and more complex, it's more and more important. I want to understand the models. I want to understand the models for the world to be good and I want to understand the models because I think that gives us some understanding into the nature of intelligence and our own decision-making processes. We're not explicitly doing model understanding at Weights & Biases, but we're trying to build tools, and we'll talk more about this over the next six months, I guess, that head in that direction. Lukas: I'm tempted to answer this question, but I feel like maybe it's better as a host if I remain mysterious in terms of what I think is the most underrated topic. I will say I love the company that we started and I would not change it. Obviously it was a really good choice to do it. But one thing that we kicked around in the early days... Well, we kicked around two ideas. One I think is a terrible idea, which is a painting drone, which I think would be fun, but probably just terrible idea. But still, it would be really fun. The second idea which I think I would still love to do it, and I think we thought we were the wrong team to do it, but actually I think we might have been a good team to do it, but the timing seems like it might have been bad, which is to build a better simulator to help robot companies simulate what they're doing and then deploy into the real world. I just think that that still kind of needs to exist. There's different takes on it, but it doesn't seem like it's been nailed at all, especially with the physical part of it. I also think that would be a really fun company to do, although a much slower ride. But my answer to the most underrated topic in ML is still a secret. Lukas: The final question is when you look at actually all of our customers trying to deploy models successfully, what do you think their biggest pain point is? We always ask guests this who are mostly at these companies trying to do it, but looking out at everyone — maybe we restrict this to people already using Weights & Biases because to people not using somebody, obviously maybe that's their biggest problem — but the people that are already using us, what is the big thing that they run into? Shawn: I mean, there is a cop-out answer, which is it's probably hiring. Lukas: That might be accurate. Hiring? Yeah. At some point, we should collect some stats on what people say and maybe that will just answer it based on...at least, people we interview. But yeah, what are you guys seeing? Shawn: It really varies in our customers because somebody who is building a self-driving car, for example, is they're building 100 models in parallel and with no proof that self-driving cars could still actually exist and work on public streets. I guess, we're getting very close now. As opposed to somebody maybe who's got a bunch of financial data and needs to predict credit scores. In the credit score problem, you actually...model understanding, what I was just talking about, really, really important. You probably need to use something dumber than a deep learning model so that you can actually say why you made a particular credit prediction. Our customers are extremely varied. I hope we're solving a lot of the problems in the model creation side of things. I think that there is a really hard problem of figuring out what models are doing in production and then taking the data from production and integrating it back into the model training process. I hope that we get to work on that problem too, but I bet that a lot of our customers would express that they have challenges there. Chris: Yeah, I'll piggyback on Shawn's response and say specifically CI and CD, when it comes to these ML pipelines, just is nowhere close to what we have in the regular software development world. I know we have a lot of exciting things on our roadmap to help with automating all of these steps as a model moves through the pipeline and then comes back to get retrained and understanding how it's performing in production, but I personally look forward to the day that all of this can be automated in a way that doesn't involve people manually running shell scripts, which is often the case today and really unfortunate. Lukas: Awesome. Well, thanks so much, guys. It's been a real pleasure working with you and can't wait for many more years. Shawn: You guys too. Chris: Can't wait for more podcasts, Luke. You're a fantastic podcast host. Lukas: Yeah, we got to bring you back a year from now, see where we're at. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we worked really hard to produce. So check it out. Lukas: I thought it'd be fun, since it's a friendly guest obviously, to add some zany new features to this podcast. I thought one section would be..I'm going to name a technology and then you got to immediately say underrated or overrated, and then if you disagree, we can fight it out. What do you think? Shawn: Let's go. Lukas: Are you ready? Okay. Reinforcement learning. Shawn: ls there a middle ground? Wait, is it only under or over? Lukas: Yeah, you got to decide. Shawn: I love reinforcement learning. Underrated. Chris: Yeah, I'll go with Shawn, underrated. Lukas: All right, all right. AutoML. Chris: Overrated. Shawn: Yeah. Chris: Come on, Shawn. It's bad for business. Shawn: Underrated. Lukas: I already forgot what you said, Chris. You think it's overrated? Chris: Yeah, yeah. Lukas: Wait, why? Chris: Because I think it's really important that the ML practitioner is using their own creative powers to make choices about how the model is architected. It's like automation everywhere else. I have mixed feelings about it. Lukas: Shawn, you want a quick rebuttal to that? Shawn: Yeah, I mean, I'm torn on this one. There is two good answers, but I do think...of course, the technology to train models should evolve over time and things should get smarter. For example, our Sweeps tool does help you find good parameters for a model. I think as a company, what we need to do is as those tools get better and better at automatically building models, there's all kinds of other problems around model building, like getting the right data in the first place, and we need to build those. This space is moving so quickly, and of course AutoML will probably continue to improve, but there's lots of other problems around it to be solved for the practitioners. Lukas: All right. Well, I think people should leave comments for which they found more convincing, but definitely bonus points for the Weights & Biases plug in there. I like it. Okay, next one. The singularity. Chris: Underrated. Shawn: Yup. I'm with Chris. Underrated. Lukas: Whoa, whoa. Bold. All right. We'll move on. Okay, ready? Bigtable. Chris: Overrated. Shawn: I'll go with underrated. Sure. Lukas: Whoa, all right. Shawn, maybe you go first. Shawn: I think Bigtable...I was at Google when Bigtable was starting to be used. At the time, it's like 2006, Bigtable was really big inside of Google. It enabled all of the technology that we have. The search, and everything else that people were building, Gmail. It didn't exist elsewhere in the world. When I left Google, I saw the world starting to copy Bigtable and became things like NoSQL. It had a huge impact on the world. I think maybe today, raw Bigtable itself, maybe this is what Chris is probably alluding to, is a bit of a challenge. Maybe I'll let Chris take it from there. Chris: The reason I said it, because I remember a day early in the founding of this company, when we were trying to figure out where to store our metrics, and you were like, ""Bigtable will solve all of our problems. It's perfect for this."" It's kind of been a thorn in our side a little bit. I mean, it's done its job well, but you could ask a handful of engineers at Weights & Biases and I bet you most of them would gripe about Bigtable. I think our use case is a little funky for the way we're using it. Shawn: Yeah. For large time series where you want to fetch a few million contiguous points at a time, especially in a shared Bigtable cluster that you get from Google, there's some challenges there. Lukas: Okay. Python. Shawn: Underrated. Chris: Yeah, underrated. Lukas: Okay. Jupyter. Shawn: Underrated. Chris: Underrated. Lukas: JupyterHub. Chris: Underrated. Shawn: Underrated. Great. Lukas: Kubeflow. Shawn: Underrated. Chris: Underrated. Lukas: All right. All right. SageMaker. Chris: Overrated. Shawn: Overrated. Lukas: Interesting. TensorFlow. Shawn: Underrated. Chris: Underrated. Lukas: Wow, you guys are super aligned. Hard to get some dissent here. Shawn: Yeah, we got to live up to your podcast name. Lukas: Yeah, yeah. I guess we'll get some other pairs to guess or maybe we could ask other ones the same questions and then see what they say. Shawn: I mean, do you differ from us in any of those, Luke? Lukas: I mean, some of these technologies, I'm even kind of hazy on what they are. Like I was thinking BigQuery, Bigtable. I was hoping I would learn what the difference is to be honest.",8575 +Pete Warden — Practical Applications of TinyML,https://www.youtube.com/watch?v=34rL1V9PpZQ,3208,2021-10-21,"Pete: The teams I've seen been really successful at deploying ML products, they've had people who, formally or informally, have taken on that hot responsibility for the whole thing, and have the people who are writing the inner loops of the assembly sitting next to the people who are creating the models. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. This is a conversation with Pete Warden, well-known hacker and blogger. Among many things that he's done in his life, he started a company Jetpac, which was a very early mobile machine learning app company that was bought by Google in 2014. He's also been a tech lead and staff engineer on the TensorFlow team since then. So he's been at TensorFlow since the very beginning. He's written a book about taking ML models and making them work on embedded devices, everything from an Arduino to a Raspberry Pi. And it's something that I'm really passionate about. So we really get into it and the technical details. I think you'll really enjoy this interview. Quick disclaimer for this conversation: We had a few glitches in the audio, which are entirely my fault. I've been traveling with my family to Big Sur, which is a lot of fun, but I didn't bring all my podcasting gear, as you can probably see. If anything's inaudible, please check the transcription, which is provided in the notes. Lukas: All right, Pete, I have a lot of questions for you, but since this is my show, I'm going to start with the question that I would want to ask if I was listening. Tell me again about the time that you hacked a Raspberry Pi to train neural nets with a GPU. Pete: Oh God. Yeah, that was really fun. So back when the Raspberry Pi first came out, it had a GPU in it, but it wasn't a GPU you could use to do anything useful with, unless you want to draw things. But who wants to just draw things with a GPU? But there was some reverse engineering that had been happening and some crazy sort of engineers out there on the hardware side who'd actually managed to get a manual describing how to use the...how to program the Raspberry Pi GPU at low level. And this had been driving me crazy ever since I'd been at Apple years ago, because I was always able to use GLSL and all of these comparatively high level languages to program GPUs. But I was always trying to get to do things that the designers hadn't intended. Like when I was at Apple, I was trying to get them to do image processing rather than just doing straight forward graphics. And I never — You may hear a dog in the background. That is our new puppy, Nutmeg — but I always wanted to be able to program them. I knew that there was an assembler level that I could program them at, if I only had access. I spent five years at Apple, tried to persuade ATI and NVIDIA to give me access. And I actually managed to persuade them, but then the driver people at Apple were like, ""No, don't give him access because then we'll have to support the crazy things he's doing. So when the Raspberry Pi came along- Lukas: Was this Raspberry Pi 1 or 2 or 3? Pete: This was back in the Raspberry Pi 1 days. So it was not long after it had first come out and they actually gave you the data sheet for the GPU, which described the instruction format for programming all of these weird little hardware blocks that were inside the GPU. There really wasn't anything like an assembler. There wasn't...basically anything that you would expect to be able to use. All you had was the raw, like, ""Hey, these are the machine code instructions."" And especially back in those days, in Raspberry Pi 1 days, there wasn't even any SIMD instructions, really, on the Raspberry Pi because it was using an ARMv6. Lukas: What is a SIMD instruction? Pete: Oh, sorry. Single Input Multiple Data. So if you're familiar with X86, it's things like SEE or AVX. It's basically a way of saying, ""Hey, I've got an array of 32 numbers. Multiply them all"", and specifying that in one instruction versus having a loop that goes through 32 instructions and does them one at a time. It's a really nice way of speeding up anything that's doing a lot of number crunching, whether it's graphic or whether it's, in our case, machine learning. I really wanted to get some cool image recognition stuff. Since back when AlexNet was all the rage, I wanted to get AlexNet running in less than a 30-second frame on this Raspberry Pi. The ARMv6 really was...it was like, I think it was just like Broadcom had some dumpster full of these chips they couldn't sell because they were so old. This is not official. I have no idea if this is true, but it feels true. And so they were like, ""Oh sure, use them for this, whatever, this Raspberry Pi thing that we're thinking about."" They were so old that it was actually really hard to even find compiler support. They didn't have, especially, these kinds of modern optimizations that you would expect to have. But I knew that this GPU could potentially do what I wanted. So I spent some time on the data sheet. There were a bunch of...a handful of people did some open source hacking on this stuff so I was able to kind of fork some of their projects. Actually funnily enough, some of the Raspberry Pi founders were actually very interested in this too. I ended up kind of hacking away and managed to figure out how to do this sort of matrix multiplication. And that, funnily enough, one of the people who was really into this was actually Eben Upton, the founder of Raspberry Pi. So he was actually one of the few people who actually replied on the forums when I was sending out distress signals when I was getting stuck on stuff. So anyway, yeah I ended up being able to use the GPU to do this matrix multiplication so I could actually run AlexNet, recognize a cat or a dog in 2 seconds rather than 30 seconds. It was some of the most fun I've had in years because it really was just like trying to string things together with sticky tape and chicken wire. Yeah, I had a blast. Lukas: How does it even work? You're writing assembly and running it on a GPU. What environment are you writing this in? Pete: So I was pretty much using a text editor. There were a couple of different people had done some work on assembly projects. None of them really worked, or they didn't work for what I needed. So I ended up sort of hacking them up together. So I then feed in the text into the assembler, which would produce the raw kind of command streams. Then I had to figure out the right memory addresses to write to from the Raspberry Pi CPU to upload this program. And then that program would be sitting there in the, I think there was something like, some ridiculously small number of instructions I could run, like 64 instructions in there or something or 128. The program would be sitting there on all of these, I think there was four or eight cores. I would then have to kick them off. I'd have to feed in the memory from the...and it was, I mean, honestly it was like, in terms of software engineering, it was a disaster. But it worked. Lukas: Well. What kind of debugging messages do you get? I mean, I'm thinking back to college and writing this. I remember the computer would just crash I think when there was invalid... Pete: Well, I was actually writing out to a pixel, so I could tell by the pixel color how far through the program that it had actually got. Which...I'm color blind, so that didn't help. But yeah, it was really getting...it was getting down and dirty. It was the sort of thing where you can just lose yourself for a few weeks in some really obscure technical problems. Lukas: I mean, having worked on projects kind of like that, how did you maintain hope that the project would finish in a way that it would work? I think that might be the hardest thing for me to work on something like that. Pete: Well, at the time I was working on a startup and this seemed a much more tractable problem than all of the other things I was dealing with at the startup. So it, in a lot of ways it was just, it was procrastination on dealing with worse problems. Lukas: Great answer. Pete: Yeah. Lukas: What was the reason that the Raspberry PI included this GPU that they wouldn't actually let you directly access? Was this for streaming video or something? Pete: Yeah, it really was designed for, I think, early 2000 set top boxes and things. You were going to be able to draw a few triangles but you weren't going to be able to run any...it wasn't designed to run any shaders or anything on it. So, GLSL and things like that weren't even considered for it at that time. I think there's been some work on that since, I think, maybe with some more modern versions and GPUs. But back in the Raspberry Pi 1 days it's just like, you can draw some triangles and that's... Lukas: Have you been following the Raspberry Pi since? Do you have thoughts on the 4 and did they talk to you about what to include there maybe? Pete: No, no. I think they knew better because I'm not exactly an average user. I mean, as a sort of a general developer, it's fantastic, because the Raspberry Pi 4 is this beast of a machine with multi-threading and it's got those SIMD instructions I talked about. There's, I think, support for GLSL and all these modern OpenGL things in the GPU. But as kind of a hacker I'm like, ""Oh..."" Lukas: Well, it's funny because I think I met you when I was trying to get TensorFlow to run on the Raspberry Pi 3, which is literally just trying to compile it and link in the proper libraries. I remember completely getting stuck. I mean, I'm ashamed to tell you that, and reaching out on the forums and being like, ""Wow, the tech support from TensorFlow is unbelievably good, it's answering my questions."" Pete: Well, I think you ended up...you found my email address as well. I think you dropped me an email and again I think you caught me in the middle of procrastinating on something that I was supposed to be doing. And I was like, ""Oh wow, this is way more fun. Let me spend some time on this."" But no, I mean, you shouldn't underestimate that TensorFlow has so many dependencies. Which is pretty normal for a Python sort of cloud server sort of project, because they're essentially kind of free in that environment. You just do like a ""pip install"" or something and it will just work. But as soon as you're moving over to something that's not the vanilla sort of x86 Linux environment that it's expecting, you suddenly sort of have to pay the price of trying to figure out all of these...""Where did this come from?"" Lukas: Right. Right. So I guess one question that comes to mind for me that I don't know if you feel like it's a fair question for you to answer, but I'd love your thoughts on it is it seems like everyone trains their models, except for people at Google, train their models on NVIDIA GPUs. I'm told that's because of the CUDA library that essentially compiles, and cuDNN that makes a low level language for writing ML components and then compiling them onto the NVIDIA chip. But if Pete Warden can just directly write code to do matrix multiplication on a chip that's not even trying to publish its docs and let anyone do this, where's the disconnect? Why don't we see more chips being used for compiling? Why doesn't TensorFlow work better on top of more different kinds of architecture? I know that was one of the... I think that was one of the original design goals of TensorFlow, but we haven't seen maybe the explosion of different GPU architectures that I think we might've been expecting back in 2016, 2017. Pete: Yeah. I can't speak so directly to the TensorFlow experience, but I can say more generally what I've seen happening, speaking personally is, it's the damn researchers. They keep coming up with new techniques and better ways of training models. What generally tends to happen is it follows the same model that sort of Alex Krizhevsky originally did and his colleagues with AlexNet, where the thing that blew me away when I first started getting into deep learning was....Alex had made his code available and he had not only been working at the high-level model creation side, he'd also been really hacking on the CUDA kernels to run on the GPU to get stuff running fast enough. It was this really interesting...having to kind of understand all these high-level concepts, these cutting edge concepts of machine learning, while also being this in a loop kind of assembly, essentially...not quite down to that level, but like intrinsic, really thinking about every cycle. What has tended to happen is that as new techniques have come in, the researchers tend to just — for their own, to run their own experiments — they have to write things that run as fast as possible. So they've had to learn how to...the default for this is CUDA, so you end up with new techniques coming in as a CUDA implementation. Usually there's a C++ CPU implementation that may or may not be particularly optimized and then there's definitely a CUDA implementation. Then the techniques that latch on, the rest of the world has to then figure out how to take what's often great code for its purpose, but is written by researchers for research purposes and then figure out how to port it to different systems with different precisions. There's this whole hidden amount of work that people have to do to take all of these emerging techniques and get them running across all architectures. I think that's true across the whole ecosystem. It's one of the reasons that I really love for experimenting — if you're in the Raspberry Pi sort of form factor, but you can afford to be burning 10 watts of power — grab a Jetson or a Jetson Nano or something, because then you've got essentially the same GPU that you'd be running in a desk machine just on a much smaller form factor. Lukas: Totally. Yeah. It makes me a little sad that the Raspberry Pi doesn't have an NVIDIA chip on it. Pete: The heat sink alone would be... Lukas: One thing I noticed...your book is excellent, on embedded ML. Actually I was in a different interview — which we should pull that clip of an interview with Pete Skomoroch — and we both had your book at our desks, so we had both been reading it. I don't know if you know him but- Pete: Yeah, I'm a good...yeah, Pete's awesome. He's been doing some amazing stuff too. He's another person who occasionally catches me when I'm procrastinating and I'm able to offer some advice and vice versa. Lukas: We should have a neighborhood... Pete: Yeah. Procrastination, hacking procrastination list. Lukas: It seems pretty obvious that you do some interesting projects in your house or for personal stuff. I was wondering if you could talk about any of your own personal ML hack projects. Pete: Oh, that's a really...I'm obsessed with actually trying to get a magic wand working well. Lukas: Tell me more. Pete: One of the things I get to see is...would be these applications that are being produced by industry professionals for things like Android phones, smart phones in general. The gesture recognition using accelerometers just works really well on these phones, because people are able to get it working really well in the commercial realm. But I haven't seen that many examples of it actually working well as open source. Even the example that we ship with TensorFlow Lite Micro is not good enough. It's a proof of concept, but it doesn't work nearly as well as I want. So I have been...that's been one of my main projects I keep coming back to is, ""Okay, how can I actually do a Zoro sign or something holding — I've got the little Arduino on my desk here — and do that and have it recognize..."" I want to be able to do that to the TV screen and have it change channels or something. What I've really wanted to be able to do — we actually released some of this stuff as part of Google IO, so I'll share you a link. Maybe you can put it in the description afterwards — but my end goal, because these things actually have Bluetooth, I want it to be able to emulate a keyboard or a mouse or game pad controller and actually be able to customize it so that you can — or a MIDI keyboard even as well — and actually customize it so you can do some kind of gesture and then have it...you do a ""Z"" and it presses the Z key or something on your virtual keyboard, and that does something interesting with whatever you've got it connected up to. So, that isn't quite working yet. But if I...hopefully I get some tough enough problems in my main job that I'll procrastinate and spend some more time on that Lukas: Man, I hope for that too. For people that maybe aren't experts in embedded computing systems, could you describe the difference between a Raspberry Pi and an Arduino? And then the different challenges in getting ML to run on a Raspberry Pi versus an Arduino? Pete: At a top level, the biggest difference is the amount of memory. This Arduino Nano BLE Sense 33 is...I think it has 256K of RAM and either 512K or something like that of flash, kind of read-only memory. It's this really, really small environment you actually have to run in, and it means you don't have a lot of things that you would expect to have through an operating system, like files or printf. You're really having to look at every single byte. The printf function itself...in a lot of implementations it will actually take about 25 kilobytes of code size just having printf because printf is essentially this big switch statement of, ""Oh, have you got a percent F? Oh, here's how you print a float value,"" and there's hundreds of these modifiers and things you never even think of for printing things you can ever imagine, and all that code has to get put in if you actually have printf in the system. All of these devices that we're aiming at, they often have only a couple of hundred kilobytes of space to write your programs in. You may be sensing a theme here, I love to fit...take modern stuff and fit it back into something like a Commodore 64. Lukas: It seems like Pete Warden doesn't always need a practical reason to do something, but what might be the practical reason between an Arduino versus a Raspberry Pi? Pete: Luckily I've actually managed to justify my hobby and turn it into my full-time project, because one great example of where we use this is...let's see my phone here, let's get a hold of my phone, you know what a phone looks like. If you think about things like — I won't say the full word, because it will set off people's phones — but the OK-G wake word or the wake words on Apple or Amazon. When you're using a voice interface, you want your phone to wake up when it hears you say that word, but what it turns out is you can't afford to even run the main ARM application processor 24/7 to listen out for that word because your battery would just be drained. These main CPUs use maybe somewhere around a watt of power when they're up and running, when you're browsing the web or interacting with it. What they all do instead is actually have what's often called an ""always on"" hub or chip or sensor hub or something like that, where the main CPU is powered down so it's not using any energy, but this much more limited, but much more lower energy chip is actually running and listening to the microphone and running a very, very small — somewhere on the order of 30 kilobytes — ML model to say, ""Hey, has somebody said that word, that wake word phase that I'm supposed to be listening out for?"" They have exactly the same challenges. You only have a few hundred kilobytes at most. You're running on a pretty low-end processor. You don't have an operating system, every byte counts. So you have to squeeze the library as small as possible. That's one of the real-world applications where we're actually using this TensorFlow Lite Micro. More generally, the Raspberry Pi is...you're probably looking at $25, something like that. The equivalent — which the Raspberry Pi Foundation just launched last year or maybe at the start of this year — that's kind of the equivalent of the Arduino, is the Pico. And that's, I think $3 retail. The Raspberry Pi, again, uses one or two watts of power so if you're going to run it for a day, you essentially need the phone battery that it will run down over the course of a day. Whereas the Pico is only using a hundred milliwatts, a 10th of a watt. You can run it for 10 times longer on the same battery, you can run it on a much smaller battery. These embedded devices tend to be used where there's power constraints, or there's cost constraints, or even where there's form factor constraints, because this thing is even smaller than a Raspberry Pi Zero and you can stick it anywhere and it will survive being run over and all of those sorts of things. Lukas: Can you describe — let's take, for example, a speech recognition system — can you describe the differences of how you would think about training and deploying if it was going to the cloud or a big desktop server versus a Raspberry Pi versus an Arduino? Pete: Yeah. The theme again is size and how much space you actually have on these systems. You'll be thinking always about, ""How can I make this model as small as possible?"" You're looking at making the model probably in the tens of kilobytes for doing...we have this example of doing speech recognition and I think it uses a 20 kilobyte model. You are going to be sacrificing accuracy and a whole bunch of other stuff in order to get something that will actually fit on this really low energy device. But hopefully it's still accurate enough that it's useful. Lukas: Right. How do you do that? How do you reduce the size without compromising accuracy? Can you describe some of the techniques? Pete: I actually just blogged about what trick that I've seen used but I realized I hadn't seen in the literature very much. Which is where — the classic going back to AlexNet approach — after you do a convolution in an image recognition network, you often have a pooling stage. That pooling stage would either do average pooling or max pooling. What that's doing is it's taking the output of the convolution, which is often the same size as the input but with a lot more channels, and then it's taking blocks of 2 by 2 values and it's saying, ""Hey, I'm going to only take the maximum of that 2 by 2 block. So, take 4 values and output 1 value, or do the same but do averaging. That helps with accuracy. But because you are outputting these very large outputs from the convolution, that means that you have to have a lot of RAM because you have to hold the input for the convolution and you also have to hold the output, which is the same size of the input, but typically has more channels, so the memory size is even larger. Instead of doing that, a common technique that I've seen in the industry is to use a stride of 2 on the convolution. Instead of having the sliding window just slide over 1 pixel every time as you're doing the convolutions, you actually have it done in 2 pixels, horizontally and vertically. That has the effect of outputting the same result as you would...or the same size, same number of elements that you would get if you did a convolution process of a 2 by 2 pooling. But it means that you actually do less compute and you don't have to have nearly as much active memory kicking around. Lukas: Interesting. I had thought maybe with the size of the model it was just the size of the model's parameters, but it sounds like you also...obviously you need some active memory. But it's hard to imagine that even could be on the order of magnitude of the size of the model. Literally the pixels of the image and then the intermediate results can be bigger than the model? Pete: Yeah. That's the nice thing about convolution. You get to reuse the weights in a way that you really don't with fully connected layers. You can actually end up with convolution models, the activation memory taking up a substantial amount of space. I'm also getting into the weeds a bit, because the obvious answer to your question is also quantization. Taking these floating point models and just turning them into 8-bit, because that immediately slashes all of your memory sizes by 75%. Lukas: I've seen people go down to 4 bits or even 1 bit. Do you have thoughts on that? Pete: Yeah. There's been some really interesting work. A colleague of mine actually — again, I'll send on a link to the paper — looked at...I think it's something about the pareto-optimal bit depth for ResNet is 4 bits or something like that. There's been some really really good research about going down to 4 bits or 2 bits or even going down to binary networks with 1 bit. The biggest challenge from our side is that CPUs aren't generally optimized for anything other than 8-bit arithmetic. Going down to these lower bit depths requires some advances in the hardware they're actually using. Lukas: Do you have any thoughts about actually training on the edge? I feel people have been talking about this for a long time, but I haven't seen real world examples where you can actually do some of the training and then it passes that upstream. Is that... Pete: What I've seen is that, especially on the embedded edge, it's very hard to get labeled data. Right now, there's been some great advances in unsupervised learning, but our workhorse approach to solving image and audio and accelerometer recognition problems is still around actually taking big labelled data sets and just running them through training. If you don't have some implicit labels on the data that you're gathering on the edge, which you always never do, it's very hard to justify training. The one case where I actually have seen this look like it's pretty promising, is for industrial monitoring. So when you've got a piece of machinery and you basically want to know if it's about to shake itself to bits because it's got a mechanical problem, and you have an accelerometer or microphone sensor sitting on this device. The hard part is telling whether it's actually about to shake itself to bits or whether that's just how it normally vibrates. One promising approach for this predictive maintenance is to actually spend the first 24 hours just assuming that everything is normal and learning, ""Okay, this is normality And then only after that, start to look for things that are outside of the...you're implicitly labeling the first 24 hours, ""Okay, this is normal data,"" and then you're looking for anything that's an excursion out beyond that. That makes sense for some kind of a training approach. But even there, I still actually push people to consider things like using embeddings and other approaches that don't require full backpropagation to do the training. For example, if you have an audio model that has to recognize a particular person saying a word, try and have that model produce an N-dimensional vector that's an embedding, and then have the person say the word 3 times, and then just use k-nearest neighbor approaches to tell if subsequent utterances are close in that embedding space. You've done something that looks like learning, from a user perspective, but you don't have to have all this machinery of variables and changing the neural network and you're just doing it as a post-processing action. Lukas: Do you see a lot of actual real world uses, like actual companies shipping stuff like models into micro controllers? Pete: Yeah. This is hard to talk about because these aren't Android apps and things where people are fairly open and open source. A lot of these are pretty well-established old-school industrial companies and automotive companies and things like that. But we do see...there's a bunch of products out there that are already using ML under the hood. One of the examples I like to give is when I joined Google back in 2014, I met Raziel Alvares — who's now actually at Facebook doing some very similar stuff, I believe — but he was responsible for a lot of the OK-G work. They've been shipping on billions of phones, using ML and specifically using deep learning, to do this kind of recognition.But I had no idea that they were shipping these 30-kilobyte models to do ML, and they had been for years. From my understanding, from what I've seen of Apple and other companies, they've been using very similar approaches in the speech world for a long time. But a lot of these areas don't have the same expectation that you'll publicize work, that we tend to in the modern ML world. It flies below the radar. These things are...there's ML models already running in your house, almost certainly right now, that are running on embedded hardware. Lukas: Besides the audio recognition, what might those ML models in my house be doing? Can you give me a little bit of flavor for that? Pete: Yeah. Accelerometer, recognition, trying to tell if somebody's doing a gesture, or if a piece of machinery is doing what you're expecting. The washing machine or the dishwasher or things like that, trying to actually take in these signals from noisy sensors and actually try and tell what's actually happening. Lukas: Do you think there's a ML model in my washing machine? Pete: I would not be at all surprised. Lukas: Wow. Pete: Yeah. Lukas: I guess another question that I had for you, thinking about your long tenure on TensorFlow — which is such a well known library — is how has that evolved over the time you've been there? Have things surprised you in the directions that it's taken? How do you even think about, with a project like that, what to prioritize into the future? Pete: Honestly how big TensorFlow got and how fast really blew me away. That was amazing to see. I'm used to working on these weird technical problems that I find interesting and following my curiosity. I'd been led to TensorFlow by pulling on a piece of yarn and ending up there. It was really nice to see...not just TensorFlow, but PyTorch, MXNet, all of these other frameworks, there's been this explosion in the number of people interested. Especially, there's been this explosion in the number of products that have been shipping. The number of use cases that people have found for these has been really mind blowing. I'm used to doing open source projects which get 10 stars or something, and I'm happy. But seeing TensorFlow and all these other frameworks just get this mass adoption has been...yeah. It definitely surprised me, and has been really nice to see. Lukas: What about in terms of what it does? How has that evolved? What new functionality gets added to a library like that? Why do you make so many breaking changes? Pete: Yes. I would just like to say I am sorry [laughs]. It's such a really interesting problem, because we're almost coming back to what we were talking about with Alex Krizhevsky. The classic example of the ML paradigm that we're in at the moment is you need a lot of flexibility to be able to experiment and create models and iterate new approaches, but all of the approaches need to run really, really, really, really fast because you're running millions of iterations, millions of data points through each run just in order to try out one model. So you've got this really challenging combination of you need all this flexibility, but you also need this cutting edge performance, and you're trying to squeeze out the absolute maximum amount of throughput you can out of the hardware that you have. So you end up with this world where you have Python calling into these chunks of these operators or these layers, where the actual operating layers themselves are highly, highly optimized, but you're expecting to be able to plug them into each other in very arbitrary ways and preserve that high performance. Especially with TensorFlow, you're also expecting to be able to do it across multiple accelerated targets. Things like the TPU, CPUs, and AMD, as well as NVIDIA GPUs. Honestly, it's just a really hard engineering problem. It's been a couple of years now since I've been on the mainline TensorFlow team, and it blew my mind how many dimensions and combinations and permutations and things they had to worry about in terms of getting this stuff just up and running and working well for people. It is tough as a user because you've got this space shuttle control panel of complexity and you probably only want to use part of it, but everybody wants a different- Lukas: Right, right. Well, maybe this is I guess a naive question, but when I look at the cuDNN library, it looks pretty close to the TensorFlow wrapper. Is that right? It seems like it tries to do the same building blocks that TensorFlow has. So I would think with NVIDIA, it would be a lot of just passing information down into cuDNN? Pete: Yeah. I mean, where I saw a lot of complexity was around things like the networking and the distribution and the very fast...making sure that you didn't end up getting bottlenecked on data transfer as you're shuffling stuff around. We've had to go in and mess around with JPEG encoding and try different libraries to figure out which one would be faster because that starts to become the bottleneck at some point when you're throwing your stuff onto the GPU fast enough. I have to admit though, I'm getting out of my...I've looked at that code in wonder. I have not tried to fix issues there, so I'm... Lukas: Amazing. I guess one more question on the topic. How do you test all these hardware environments? Do you have to set up the hardware somewhere to run all these things before you ship the library? Pete: Well, that's another pretty...the task of doing the continuous integration and the testing across all of these different pieces of hardware and all the different combinations of, ""Oh, have you got 2 cards in your machine? Have you got 4? Have you got this version of Linux? Are you running on Windows? Which versions of the drivers do you have? Which versions of the accelerators on cuDNN?"" All of these, there are farms full of these machines where we're trying to test all of these different combinations and permutations, or as many as we can, to try and actually make sure that stuff works. As you can imagine, it's not a straightforward task. Improving datasets and model deployment Lukas: All right. Well, we're getting close to time, and we always end with two questions that I want to save time for. One question is what is an underrated topic in machine learning that you would like to investigate if you had some extra time? Pete: Datasets. The common theme that I've seen throughout all the time I've worked with...I've ended up working with hundreds of teams who are creating products using machine learning, and almost always what we find is that investing time in improving their datasets is a much better return on investment than trying to tweak their architectures or hyper-parameters or things like that. There are very few tools out there for actually doing useful things with datasets and improving datasets and understanding datasets and gathering datasets and data points, and cleaning up labels. I really think...I'm starting to see...I think Andrew Ng and some other people have been talking about data-centric approaches and I'm starting to see more focus on that. But I think that that's going to just continue, and it's going to be...I feel like as the ML world is maturing and more people are going through that experience of trying to put a product out and realizing, ""Oh my god, we need better data tools,"" there's going to be way more demand and way more focus on that. That is an extremely interesting area for me. Lukas: Well, you may have answered my last question, but I think you're well-qualified to answer it, having done a bunch of ML startups and then working on TensorFlow. When you think about deploying an ML model in the real world and getting it to work for a useful purpose, what do you see as the major bottlenecks? I guess datasets is one, I agree, is maybe the biggest one, but do you see others too? Pete: Yeah. So, another big problem is there's this artificial distinction between the people who create models, who often come from a research background, and the people who have to deploy them. What will often happen is that the model creation people will get as far as getting an eval that shows that their model is reaching a certain level of accuracy in their Python environment, and they'll say, ""Okay, I'm done. Here's the checkpoints for this model,"" which is great, and then just hand that over to the people who are going to deploy it on an Android application. The problem there is that there's all sorts of things like the actual data in the application itself may be quite different to the training data. You're almost certainly going to have to do some stuff to it like quantization or some kind of thing that involves re-training, in order to have something that's optimal for the device that you're actually shipping on. There's just a lot of really useful feedback that you can get from trying this out in a real device that someone can hold in their hand and use that you just don't get from the eval use case. So coming back to actually Pete Skomoroch, I first met him when he was part of the whole DJ Patil and the LinkedIn crew doing some of the really early data science stuff. They had this idea..,I think it was DJ who came up with the naming of data science and data scientists as somebody who would own the full stack of taking everything from doing the data analysis to coming up with models and things on it to actually deploying those on the website and then taking ownership of that whole end-to-end process. The teams I've seen been really successful at deploying ML products, they've had people who, formally or informally, have taken on that hot responsibility for the whole thing, and have the people who are writing the inner loops of the assembly sitting next to the people who are creating the models. The team who created MobileNet, Mobile-Vision, with Andrew Howard and Benoit Jacob, they were a great example of that. They all work very, very closely together doing everything from coming up with new model techniques to figuring out how they're actually going to run on real hardware at the really low level. So, that's one of the biggest things that I'm hoping to see change in the next few years as more people adopt that model. Lukas: Well said. Thanks so much, Pete. That was super fun. Pete: Yeah, thanks, Lukas. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description, where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.",6989 +Chris Albon — ML Models and Infrastructure at Wikimedia,https://www.youtube.com/watch?v=l1flCSH_n9k,3383,2021-09-23,"Chris: When you have small teams the value of ML is you could start to really scale things out because you start to use machines as the assistant to you, right? So you train something manually and then you send it out in the world, and then it does that at scale for you, which is like a superpower. And so I just started going farther and farther down the path of saying, ""Hey, we can make this team of 4 people behave like a team of 50 people if we start to use ML more and more."" Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Chris Albon is the director of machine learning at the Wikimedia Foundation. Before that, he had a number of really interesting jobs, director of data science at Devoted Health, director of data science at Ushahidi, which is an open source non-profit that did mapping, and a project director at FrontlineSMS. He's also a well known educator on machine learning, the author of Machine Learning Flashcards, Machine Learning with Python Cookbook, and several fantastic machine learning tutorials. I'm super excited to talk to him today. Maybe we'll jump into...there's kind of a theme around, I guess, moderation and truth and security that I'm sure you think about a lot. One question we got from Twitter was basically, someone was wondering if Wikipedia has experimented with tools for moderators or kind of tools for educating disputes. I have to say, I've seen a lot of fighting in the comments at Wikipedia pages, and I'm kind of always impressed that they resolve, but are there special tools or algorithms that you all use? Chris: No, and I mean, that's really sort of foundational to Wikipedia, that it is...at the end of the day, it is a human project of humans deciding, like trying to get the truth. What is the perspective? What's that neutral perspective between the parties? I have to say that after joining the foundation, the new thing that I do is I care less about the Wikipedia page, I really care about what's called the talk page. Every page of Wikipedia has like a separate comment page, where people are constantly discussing over and over again, discussing, debating, finding new information, going back and forth. With things around disinformation, we've definitely been exploring areas of, for example, sock puppet protection — sock puppets, if you have like lots of accounts — and trying to build models that help predict that. Around dispute resolution...at the end of the day, if you see something on Wikipedia, we really want you to think, ""Okay, cool, at the end of the day a human has decided this, a human has made this kind of decision."" And so things like algorithmically making those decisions for people, that stuff is an anathema to everything of...which is why frankly, people love it so much, right? And which is why doing machine learning in that environment is so interesting, because you are trying to do things at scale with a human in the loop. You just have thousands and thousands and thousands of humans who are willing to help you out. Lukas: Totally. Another question along the same lines is, someone was asking, what are some of the most contentious Wikipedia articles and does your team ever get involved to resolve edit wars in any way? Chris: There's a lot of contentious pages across many languages. The interesting thing that I think people don't realize a lot is that I work for the Wikimedia Foundation. We are the non-profit organization that helps keep the infrastructure up. We fight legal battles for different Wikipedia communities, but each individual language of Wikipedia manages their own show, with their own rules based on a common set of norms across all Wikipedias, but it is their own show. English Wikipedia has an incredibly elaborate system of dispute resolution, different levels of user access between the admins. There is a full volunteer organization in English Wikipedia that is managing those kind of things, and that's the same with other languages. So for us at the foundation, it is very critical that we actually don't get involved and jump in because our role is sort of the folks who are one step back where like, we'll make the site better for you, we'll make your experience better, we'll recommend things that we think are interesting, we'll highlight, we'll help you work faster as an editor using ML, but we're not going to jump in and say, ""Hey, Steve was right and Jason was wrong in this particular article."" That is not our role. I think if we played that role, we would be the most hated organization very quickly. Lukas: Right. And I guess, someone else was asking is sort of the implications of ML top of mind? I would imagine it's hard to be really neutral with any kind of tool, right? Do you ever feel like there's implications for how your tooling works, even though you're really just supporting moderators? I would imagine that, for example, subtle changes in how search works might actually really change what content people are seeing, because there's such a high profile set of webpages. Chris: I think probably one of the key foundations of the team is the idea that any kind of ML that we do is not neutral, at the end of the day. Our gold standard, when we are making models, is that the model reflects the training data from the particular community that is served by that model. So for example, French Wikipedia wants a model that predicts the article quality, like if this article is really good or bad, to help editors decide which articles they should really jump in and help in. We want to get that data from the French Wikipedia community, train it, train that model, and then serve it back to the French Wikipedia community, and give that community the ability to actually manage and govern the use of that model in their system. What we're saying is like, ""Hey, there is no neutrality here"", but we will try to limit our ability to, say, train something on English Wikipedia and then apply it to Vietnamese Wikipedia, by gathering the training data from that original community and then serving back. It's not possible all the times because some models have to be like...you need to be global, scalable, there isn't like enough training data, and that kind of stuff. But that is our gold standard that we go for, and that we've done many, many times over the years. Lukas: Got it. One other question on this theme of moderation that somebody asked, and I'm kind of curious about, is what's the most common type of spam attack that you deal with, like adversarial problems that you come across on your different properties? Chris: Well, I mean, the most common one is someone like putting in like ""poop"" or swear words randomly into articles. Detecting that through...the community has actually done a great job, because I think people don't realize that English Wikipedia community and other Wikipedia committees actually have developed their own machine learning models, like as bots that they deploy by themselves with no need from the foundation. We host them, but it is theirs to do whatever they want with. But the most common one is definitely adding swear words. It's something...as you can imagine. The ones that are the most dangerous are definitely the ones that the attackers have a lot of resources. One of the things that you quickly realize when you work here is that all of our models are open source. Everything we do is open source. You can see the whole thing, you can see my internal chat. You can see my ticketing system, like my Jira is totally public. What I'm working on a given day is public. I'm live streaming the work that I'm doing every single other week or something like that. All this is open and every single article on adversarial...not adversarial machine learning, but adversarial attacks on machine learning systems, it's like, if you have the model or you could actually use the prediction really, really quickly, you can start to figure out how to game the system because you have such exposure. We are exposing ourselves to that all the time, by showing them exactly what's happening with the model, by giving them the training data. And that is that sort of give and take that you sit where, okay, how do we work to see how other people are behaving in the system in order to detect any kind of problems while also making it that like...all of our models, you can hit an API for free and just use. Use as much as you want. As long as you don't crash the system, you're good to go. People are using it tons of times, they can download the model, they can download training data, they can run it locally, they can do whatever they want. But of course, there's risk in that, right? Because there's no...people can see your entire hand. It's like playing poker where you're showing your whole hand and they're not showing any of their hand. You're definitely at a disadvantage, but it is a trust-based activity that people who spend hours and hours and hours making changes to the site, writing new articles, finding some new interesting fact and then hunting down with where to put that in, or sitting on those talk pages and debating and discussing how to exactly phrase a single sentence about some article because it's really important to get that right. That only works if they can come and see my team and say, ""Hey, they're doing everything...I can see what they're doing. I understand what they're doing. I understand where they're coming from and I can participate in that."" That's the only way we have anything, because the worst case scenario would be that people thought that what we were doing was a black box that you just couldn't see and there was some mystery behind what it was. And we were just like, ""Oh, just trust us, just trust us."" Don't trust us, come and look, come and see, run the code yourself, tell us we're wrong. We're Wikipedia, so we'll definitely invite changes all the time. Lukas: What does it feel like working with that level of transparency? I can see how it really must keep you honest around security holes and thinking really carefully around not doing security through obscurity, but like, what's the experience like? I mean, I assume that your previous roles you didn't live stream your work as you were doing it. Chris: Yeah, it's interesting, because I've worked in the non-profit space and the startup space for a long time, and in both of those spaces that I've traditionally worked, even when we were doing open source work, it was sort of off in a corner. Like there wasn't really that many people who were paying attention to it, or if you're a startup, it's literally all IP. And so you're like, deep in the bowels of the organization in the back working on some algorithm that people hope will help them raise money or something like that, but no one's going to see it, you're never going to publish it, there's never going to be a paper about it, it's just your secret sauce in the rear. At Wikipedia, because we do everything so open, I have learned to lean in on the idea of being open with a large amount of humility. Just to give a real example, we are going to start releasing model cards. So, an individual page that describes every single model that we host, and we've been sort of making prototypes and experimenting with them. The experiments are public, you can take a look at the experiment page, and you'll sort of like see what's happening. But some of the models are going to look embarrassing, like you're going to look and be like, ""Wow, that's a really bad model, I can't believe you put that in production."" And we just need to like...that is the only way to go in this scenario, is to just say, ""Hey, we are going to be open, we're not going to take offense to something that you say our model is crappy, come help us fix it."" We will lean into all the humility that we can because that is the only way to do this. The only way to do this is just to come in with a huge heaping pile of humility and openness and just let things go. It is weird, and it is different, because when you work on the team, you work on this nexus between machine learning, which a lot of people are interested in, and Wikipedia, which a lot of people are interested in. It's like working under a spotlight, in a sense. I do live streams of myself working and the first few weeks like a hundred people were showing up, and they would just watch me not know something, like not understand how something's working, not understand how my system's working. Another example is there was this bug report that I randomly saw that showed a huge percentage of the traffic of one of our data centers was because of one image, like one image was all the data. It was some flower or something like that. And I was just like, ""Oh, that's kind of cool, I'll tweet about it."" I tweeted it and then within 24 hours there was like a hundred articles about this flower that was causing all this problems on Wikipedia. People kept on coming to the fabricator tickets, like the Jira ticket that the engineers were working on to fix it, and it was crashing fabricator because of so much traffic. They were sending in messages of support and all these comments and ideas of what they thought it was, and the engineers were like, ""Just stop, just stop, we think we got it, stop posting comments."" You're in the open, you're in the open, you're in the public, and you cannot be defensive with how you do it. Because I mean, if you're really defensive about it, it's probably not a great job, so probably won't be that enjoyable. Lukas: Do you think you've had to develop a thicker skin? I know whenever I do anything that's very public, mostly the feedback is positive, but I really feel the negative feedback much more. And I think it causes...any kind of public thing we do, a little tinge of stress. I kind of can't imagine if everything was like public and visible, and people were watching it. Has it changed your mindset at all, or the way you work? Chris: Oh yeah. I think when I started, I think I had a regular thickness of skin. I was like, ""I'll do fine in this role. What would you do possibly?"" And then you see what happens, right? People don't like what you're working on. People don't think the foundation should exist. People don't think there should be machine learning in it. People think your model is wrong or dumb or stupid, or why would you do it, or like there's this particular problem, or why aren't you working on this other thing or 10,000 things. And remember, everything we do is public, so someone can post a comment about a ticket from like 2014 and say, ""Oh, this is stupid,"" or whatever. And people can like take your code and say...you get that all the time. It is something that I think everyone on the team just learns to be okay with. I think the best people who do it are the people who just come in just lean into the idea, like ""Hey, it's okay."" People like what we're doing, for the most part. Some people won't, that's okay, but there'll be people who just won't like it and there's nothing to do about that, right? There's no other way to operate. But yeah, there's definitely times where you're like, ""Oh, my God, this is brutal. This person really doesn't like me."" But, all of that pales in comparison to like the simple fact that... I get up every single morning and people pay me money to work on Wikipedia and all the other projects. That's what I do all day. I sit down and I'm like, ""This would be a cool thing to do here, we should work on this, let's change this up,"" like that's all I do. Just make Wikipedia, work on Wikidata, work on Wiki Commons, all the cool projects for all these people who've volunteered, volunteered thousands of hours to work on this stuff. My salary is paid by donations, so like people are donating 5, 10 dollars to make my salary, right? That is how I'm working on it. Once you put that more into perspective, you're able to take a lot of heat. Lukas: That makes sense. Do you find yourself getting distracted by the content? Chris: Oh, yeah. Lukas: I mean, Wikipedia, I find so fascinating. I would think if I was working directly on it... I actually remember my first job, I was writing a search engine, we were practicing on Wikipedia. And I remember, like every time I was editing the...or monitoring the search results, I'd just go down these rabbit holes on whatever topic it was pulling up. Chris: It is...it is genuinely hard. And not just the straight content, but all the layers underneath it because when you start to work on it, you realize all these little decisions that were made around like, ""Oh, how do we do licensing? Or like, what is the kind of ramifications of that?"" So for example, we have Wikimedia Commons, which is all the images that we have, and it's like, ""Oh, there's faces in the images? Why are faces allowed in these? What's the rabbit hole?"" And that's like been a huge multi-year discussion by these folks of what to do about that and if that's okay, and that kind of stuff. And then you just look at the talk pages and you look at the discourse, and there's just, there's so much. It is like the classic iceberg diagram. There is so much that's all public, but just not that front page. There'll be a page of like Dalmatian puppies, and then there's like just a huge, massive discussion of licenses and behind the scenes of how to do certain things. I have definitely become very distractible, because...research comes out about really interesting ideas and I'm sort of constantly being pinged by like, oh, there's this cool thing about how to auto translate stuff, or this cool idea of how do we detect these particular stuff, like maybe we should work on this and like try to keep the team sort of focused on pursuing just a few things to move forward, is hard enough. But I definitely do like that I can have Wikipedia open on my browser window forever and it's technically working, even though I'm randomly scrolling Prussian military history or some super duper, duper random topic. And- Lukas: Do you have a favorite Wikipedia page or topic that I could look at after this interview? Chris: I do, I do. Lukas: Tell me. Chris: It is called perpetual stew. Perpetual stew is the idea of a bowl of stew that is never stopped cooking. So it is cooked forever. And the idea is you're constantly adding to the pot and as you're taking out from the pot. This sounds like a crazy concept when you think about it that you just have this like, a hundred year old stew that you're doing. Lukas: It seems a little disgusting, is it good? Chris: See, this is why ... And then the photo is amazing because it has like a whole fish in the photo, which is like someone's thrown a whole fish. It's weird. It's weird, but that is not something that I would ever imagine, but yeah, it's a cool idea. But there's another- Lukas: That was a great answer that you just had instantly. We didn't- Chris: I spend all day looking at Wikipedia. Literally all my conversations about Wikipedia, like all the time. The images of it, the different parts of it...So yeah, I definitely have a long list of ones that I think are great. I think some of the ones that I have really appreciated have been the ones that are in the news. I don't think I really appreciated how much work the volunteers do when something is like fast-moving news. I remember during the US presidential election, and I was going to the page...there's all these procedures in place. The volunteers all on their own, they lock down the page through this process, only these kind of edits go through. How do you make changes? How do we do the wording of this kind of stuff? All that, that happens in the moment, live, which is just so cool to watch. Now whenever there's some kind of event, I immediately go to the relevant Wikipedia page and go to the talk page and watch people hash it out to like figure out how to work, which is just so cool. Lukas: That's awesome. I mean, I think one of the things I was excited to talk to you about was actually the ML infrastructure at Wikipedia, because a lot of the real world people we talk to, you have to be a little bit cagey or vague about exactly what the problems are with the infrastructures, but you're so open about this stuff that I think we can really get into the nitty gritty. Chris: Yeah. We are all open. Lukas: Before diving in, and this is actually a question that somebody asked, but I think it's a really good one to start with is, what are the important ML applications at Wikipedia? You mentioned some of them and you said some of them aren't even run by your team, but just off the top of your head, what are the things going on using ML? Chris: We do a lot of models that help editors, that's probably our main body of work. This would be things like, for example, predict if a particular edit is...we think it's a productive edit or not, or whether we think it's a damaging edit or not. The idea is not to make changes to Wikipedia ourselves, but to flag it for editors in the UI, literally the UI changes that they can say, ""Oh, okay, cool, like I should go deal with this because this edit is probably bad. So I can skip this particular edit, and I can go to this other edit,"" sort of prioritize work. We also are working on some things that we call structured tasks. The idea is that there are many ways to participate in Wikipedia, and one of the hardest barriers is that you try to get your first edit in and it's like instantly rejected, because it fails some long established rule about how things should go. And so, one thing we've been doing with structured tasks is like, can we use ML to recommend edits that we think will pass? Sort of like an easy mode. And they might be something simple, like grammar, or they might be the one that we're working on right now is a link. So like, is this word a link to another article? Should that be true? And so we'll highlight the word and then highlight where we think it should be pointing to and then ask them, is this right or not? And then if they say yeah, it becomes an edit that gets pushed to ""Production."" Our big focus is to make that editor and reader experience better using ML. There's other things that we do, like we predict the topic of the article, and we look at sock puppet stuff, but the big one is trying to make editors' experience better. Lukas: Do you build separate models for every language, or is this kind of all baked together as a single model? Chris: We traditionally do one model per language. Right now I'm looking at a kind of shift, where we end up doing one model per language for every single model that we can, but then doing a language-agnostic model for everything else. You could imagine that the 300 languages that we would support, there would be a language-agnostic model that would work for all of them, but not as good as a language specific model of where we can. Gathering the training data from each individual community is really time consuming. You can't do that 300 times with a really small team. Trying to do that balance where we can do that global coverage, but believe that the gold standard should be an individual language-based model. It doesn't happen for everything. For example, when recommending whether a link is... or recommending whether a word is a link or not for that link recommender I just described, we don't need to have a language-specific model for that, we can take advantage of that. But I know one of the questions that someone asked on Twitter was like, what am I interested in NLP, and language-agnostic models is the thing that I'm really interested in. Because when you start to do one model per language, you run into a scalability problem pretty quick, like how do you maintain with fresh training data, with monitoring, with all that stuff of like a huge breadth of languages, well beyond the languages spoke on the team? How do you maintain that? Having some kind of idea of like, okay, cool, let's do like a mix where we'll have like some models that are just across all languages, but our gold standard whenever we could, is to make like one model per individual language. What we want...the community governs the Wikimedia Foundation, like they're the ones who select members to the board. And then like the board decides what the priorities of the organization are and that trickles down to me. For us, we want communities to feel that they have the power to decide what they want to do with the model. So like, if French Wikipedia is like, hey, we want a model that predicts the edit quality, great, we'll help them get training data and put that model out. If they then decide that they don't want that model anymore, we'll turn it off, right? Because the goal is that we're here to support them and their stuff. They're the ones who are putting up the huge amount of hours and effort and time unpaid to make this stuff, we're just trying to make their lives a little bit better. Lukas: I would imagine that you have probably more requests than you can really field, how do you prioritize all the requests that come in for different models, and also improving existing models? Chris: Yeah. A lot of times what is really hard is distinguishing different types of requests. One of the things that happens a lot is that volunteers have really spiky participation, this is just sort of natural, right? They do a lot of work on something, and then they get a new job, and so they kind of disappear for six months, and then they come and do a lot of participation again, right? That's exactly how volunteering works. Because you're volunteering, you have other things. School starts, you have a new kid, you decide that you're bored of doing it, you take on another hobby, and that kind of stuff. That kind of really spiky participation means that...when I took over the team, we talked about it a lot and we decided that what we wanted to do is that if we ended up hosting anything on the foundation servers, that we will own it. If someone comes in, and really works with us and helps us build a model and that kind of stuff, and then they go off and do something else, we will continue to maintain that model in perpetuity, and keep on running with it. That means that you have to be selective of what you take, because you can't take every single thing that people are asking for if you're going to own everything that comes in. And so there is a process of deliberating what that would be and whatnot. There's other ways that people can host models at the foundation...this is a technical podcast, people are probably familiar with AWS and EC2, we run our own EC2 instance, essentially, which is what you call cloud services, where people can actually go and host their own stuff. If they wanted to host their own things on our servers, that's totally fine and they could do it through there. But when it comes to my team, we know that we need to own something, because part of our idea of what it would look like to do community-based, public, ethical ML is ownership of us saying like, ""Hey, we screwed up, that this model is bad, we screwed up that this model is harmful."" And the only way we can do that is if we actually own the model, we understand how it works, and that kind of stuff. Evaluating models that get submitted or requests for models and that kind of stuff is a real challenge, which is unique to the foundation in a way. Lukas: How many models are you owning, like running at any given time? Chris: We have, I think, 120 models right now. And maybe five that are currently being built. We stopped building new models for quite a while over the last year, because we're switching infrastructures for model deployment, which we could talk about. Lukas: Yeah, let's talk about it. Chris: There was definitely this moment where we were like...the current infrastructure, which has lasted us a really long time and is sort of what got ML at the Wikimedia Foundation off the ground, is not serving us anymore. We need to go back and figure out what to do. Because of the nuances of the foundation, the foundation is a strong believer of privacy and of open source, which means we don't use cloud-hosted services. We are not on AWS, except for like very, very small things. We're not on Google Cloud compute. We are on our own servers in our own data center, or not our own data center, but in our own racks in the data center. Building out a new model deployment system was literally starting off with like, what are the specs of the servers that you want, like how many sticks of RAM? Just to show you the level, I had conversations about how the racking was going to go. We bought a GPU to try to test if we could use it in our server and I got this photo from the person in the data center, like the Wikimedia Foundation employee in the data center. In the photo, he's trying to install the GPU into the server blade and he can't, it doesn't fit and he's like showing me in the photo that it doesn't fit. Like that's the level of like bare metal up. Which, as a technical challenge is really fun. I've taken a lot of appreciation that the foundation actually cares so much about privacy, that it is unwilling to give up anything. It is very, very thick. It is funny because there's a ton of SREs at the foundation, like most of the tech stuff is by SREs because you constantly need to have these people maintaining the systems and building the system for that low level. But, yeah. Lukas: What are these models? A lot of the questions that we got were actually like, is Wikimedia using deep learning? I guess, I should just ask that. But I actually want to be more specific, can you describe like what frameworks are you building these models in? What are they like? Chris: Yeah. So right now we have a lot of models in scikit-learn. That was sort of the initial set of models, these are the ones that are predicting article quality, and the quality of an edit or like the topic of that kind of stuff. We've started to move towards more deep learning-based models, particularly around like computer vision and NLP, because there's just big advantages to using that. Right as I joined the foundation, they were setting up some GPUs in...because we have to use our own stack. So, literally installing the GPUs in the machines and starting to work on there. As we move forward, I know we're using fastText for some model, which is that Facebook library. For me, as the person who's herding the cats in this instance, I have become very interested in simple models, because the goal of what we do at the foundation is accessibility. You should be able to understand what we're doing. Not every single person, it's okay if not everyone who doesn't work in ML understands what we're doing, but my goal is that if you see a model that we're using, here's the foundation's model for detecting whether or not this piece of text is a link or something, that you can go to an open source page on GitLab, you can see the code that's plainly documented, you can see the link to the data that you use to train it, you understand what's happening, because it's not so insanely complex that it's impossible to access. And then you can fix it, you can make it better, you can throw in improvements, that's what I want. I want people to see what we're doing. And so I am less interested in the most technical solution, I'm definitely more in the sort of practical, like what is the sort of lowest common bar that does it. That said, there's some things that are, frankly, particulary with NLP that I feel are just really complex. We were just talking this morning about some models using BERT to try to basically replace some of the models that we're using scikit-learn models on that should actually use BERT to throw in there to make it better. There is value, there is value in complexity, but it goes back to the idea that I don't want people to think that we have a secret sauce. I want people to think that we're a set of, hopefully, somewhat humble people building out in the open, and you can come and help us and participate and challenge us and ask those questions. And so the more accessible we use, the better. If we end up using a proprietary system to make it...I mean, that would never happen, but the reason that would never happen is you'd never be able to trust us that it was true, it'd just work or not work and you'd have to believe it. We want you to go and dig it. We are moving into deep learning. I actually have a big ask for GPUs...it is really hard to buy GPUs in case anybody has ever been in that world. It's super hard to do that. We're out there hunting around for GPUs that fit into our servers at this moment. Lukas: You mentioned an infrastructure change, can you talk about what prompted that, like what was happening and what infrastructure you moved to? Chris: Our system and how it's run since the beginning, was on what's called ORES, which is our homegrown model management system. Before there was Kubeflow, ML Flow, or before ML Ops was a thing, there was people at the foundation that were building, essentially, those functionalities from scratch. It is 18 servers split across two data centers. One in Virginia, one set in Texas. There was issues around...one of the things it does is, it is for deploying a very certain type of model, particularly edit quality ones and that kind of stuff. And it's really paired with the training system So the training system and the deployment system are very, very, very interconnected. Which means that you couldn't add, say, a deep learning model in there, because it wasn't part of the training system, which is also a homebred system. The big one for me, as sort of the director of the project was that, because it doesn't use serverless infrastructure, there is a hard memory requirement. So if your model is...I think the machines have 128 gigabytes of memory, each of them. And if your model is two gigabytes, you now only have 126 gigabytes of memory left. No matter how much that model is used, it could be used every single second, it could be used once a month, it is a finite amount of resource, which is very problematic for us, because so many, as we were talking about, so many people come to us and are interested in deploying a model or interested in sort of how we do things, which means that we need to...in order to participate with those people at a real level, we need to not so much care if something is really used or not, right? If someone comes in and they say, ""Hey, I have a great idea for this project,"" and we work on it with them and we create a model and then we deploy it, we need to be fine with it being dormant for months. Maybe it's only used once a year, or maybe it's used all the time, and that's okay. When you reach that finite level of...literally you're running out of RAM and every single time you need to...it's a zero sum game where you're using more and more of the physical RAM to hold the models in memory? It got too far. What I think happened is that this was sort of a pioneer in the space of MLOps. And now what has happened is there's so many great projects out there that are doing MLOps, that there's such a value to switching over. We've moved to setting up what we call lift wing, which is a Kubeflow instance on a new Kubernetes cluster that we did. And Kubeflow is a open source project for MLOps on Kubeflow. There's so many great advantages of that that we've been taking in. For example, the custom libraries. So we had a researcher, he used fastText and didn't tell us because we just hadn't made that communication. And it was fine, right? He gave us the signal, we've never seen fastText before, but hey, we'll build the server for it. We'll build the service for it, and it'll run. It means you could run deep learning models or TensorFlow or PyTorch or whatever you want to do in that system. Everything's all nicely Dockerized. We've been Dockerizing our models that we have on ORES, and just Dockering and then move the Dockerfile over to the new system. There's way more storage analytics around things are working. We want to pair it with a full training suite. Right now we're focused on model deployment, but we want to get to the point where we're doing nightly re-trainings. That would mean that we could do things like shadow models, so a prediction comes in, we serve it to two versions of the model, compare the stats of like how it's doing, sort of an A/B test, except for one of the, I guess the A, actually serves backup prediction to the user. But just a huge amount of taking advantage of that modern infrastructure. And it wasn't...when this was started at the foundation, there just wasn't this infrastructure. And now there, and so taking a step back and building that out has been really fun. I will completely admit that it is somewhat terrifying to start at a job, look around, and say, ""Hey, I think we need to build the infrastructure from scratch,"" which becomes like a planning document, which becomes a budget line, which becomes server specs, which becomes a server box deployed to a data center like to plug in, which becomes hiring SREs, which becomes slowly configuring the system, which becomes running through a thousand problems. I mean, right now, where are we right now in that system? Two days ago we got our ""Hello, World!"" that we served a prediction using the system, which was so cool to see after all that work. That's really the fun part about the foundation is that you're doing something out in the open and you're doing something, frankly, from a technical experience, from bare metal. Like from bare metal all the way up, that's how you're figuring it out. Sometimes you hate your life for it, because you're like, you know what's easy? AWS, AWS is easy. Look at all these wonderful services which they provide people. But at the same time, having the control to own the system from scratch and know that people's privacy is protected, that we have control over everything, where any of the data goes, any of that kind of stuff, which means that people can participate in the projects feeling safe, that they're not going to be exposed because they edited an LGBTQ article or something like that, we have that ability, which is so nice. And it feels so good to have that. But it is going to be a long process of us getting from...we're going to build a second cluster, which we're going to be using mostly for training. In our architecture, we're trying to split up one Kubeflow instance for model serving and keep that with really good uptime and keep that really, really simple. And then we're having a second one, which has access to the data center, which is more like, if it goes down for a day, that's fine, right? And so we could be a little bit more experimental, we can push a little bit farther, we can give more people access to the system, they can come in and break it without any kind of interruption to service and then move the models between the two, as needed. Lukas: What's the piece that Kubeflow is doing for you, it's the swapping in and out of the models? Is that the key thing that is happening? Chris: The big part is the resource management. I think that's always been the real value in that, for us, our model usage is really spiky. There's sort of always a hum, a certain amount of noise of people using the models. And then there'll be someone who wants to know a prediction of every single article on Swahili Wikipedia, and so you get this huge spike. We try very, very hard to not limit people. When we're limiting people's API access, it's because you're going to break the system if you do more. That is really our goal. We're funded by people, so people should be able to use it. Along those lines, being able to maintain that really spiky structure, particularly with models for a broad range of systems. So like, maybe one requires a GPU that uses TensorFlow, maybe two don't require one that uses scikit-learn, and then sort of managing all those resources in an automated fashion is super powerful for us, because it means that we can not have what we have currently with ORES, which doesn't do that so well. Around a year ago, I had my kid on my lap and I was manually restarting the prod server using a script I'd written after a glass of wine to try to get the server back up. Try to get around those kind of issues with something that balances those resources really well. There's other things that we care about at the foundation. Because it's an open source project, the foundation believes in open source projects, we want to contribute back to them. I think there's lots of nice...on the training side, I'm pretty excited about some of the UI parts of it. For example, Jupyter Notebooks that could connect...would allow our researchers to actually connect to the database and actually construct models in the Jupyter Notebook and then push a button to put it in production. Those are some of the things I'm interested in down the road, but just straight resource management is a big deal. It's weird, because the thing about the foundation is the foundation is 500 people, which are like, wow, that's a lot of people, but you're running like one of the top 10 websites in the world. The scale is crazy. Tying to do that with what ends up being a small team, when you cut people...down to people who are working on the tech department, people who are working on ML, people who are working on this particular system. A very small number of people have a lot of responsibility for things and so automating what you can out to these systems is pretty nice. And leaning on open source projects that other people can help you with issues, is definitely true. Lukas: Makes sense. There's another theme of questions I want to make sure that we cover here, just in our crowdsourcing of questions for you, which I'd sort of summarized as...I think people admire the career that you've had and working on really impactful stuff in machine learning. How did you get into machine learning and how have you thought about your career? How do you feel like you've managed to get to all these super interesting projects? Chris: My formal training is in quantitative research, actually quantitative social science research. I went to a PhD program that was all about stats, basically. When I was graduating, I knew some people who were working on a Kenyan non-profit and I just joined them, and kind of was working on that. And then from there, you sort of grow a community of people in a social network that you know and people keep on pulling you into other things to work on. I think for me, ML...where the appeal was was and I'm going to anger some statisticians on here. So this is hot take, hot take- Lukas: Nice, Gradient Dissent. Chris: Yeah. The thing that frustrated me about statistics is I tended to not care about the causal inference about a lot of things. I cared about the results that was happening because I was doing a lot of this stuff, as a job, in impact. So I was doing like election monitoring, like helping someone set up an SMS feature, electoral monitor. I didn't so much care about the causal relationship between whether or not someone would send a message in or not, I cared if they did or not, right? I really, really focused on outcomes. When you have small teams the value of ML is you can start to like really scale things out because you start to use machines, it's like the assistant to you, right? So you train something manually, and then you send it out in the world. And then it does that at scale for you, which is like a superpower. I just started going farther and farther down the path of saying, hey, like we can make this team of 4 people behave like a team of 50 people, if we start to use ML more and more, and keep walking down that and just get more and more complex. As I started doing things of more scale, you sort of move from the modeling side to the engineering side of like, okay, now we have 200 models. How do we make sure every single model is running at all times, that it's totally okay. And how do we do that at scale, and like constantly moving to the next more technical challenge in those range. For me, I feel like I have stumbled into this stuff. But really, it was probably because when I got started, I knew some people who were working at this teeny little tech non-profit in Kenya, and just got to know them. Then they were sort of like, oh, what about this other place? So then I switched places. And then like, hey, what about this other thing, and I joined that. You just sort of go from one thing to another, to another, to another. It's true that some of the people that I worked with 10 years ago on various projects around...environmental projects and that kind of stuff, work at Wikipedia, right? Like work at Wikimedia, like they're still here. There's this like group of people who are working on stuff. It doesn't mean that other people don't come in, and it doesn't mean that it's not a job. It is a job that I go to everyday and I do my job, but you start to see the same faces over and over again as you do this for a while and people invite you to come and apply for a role or that kind of stuff. Lukas: How does it relate to your well-known ML flashcards and tutorials. What prompted you to do that? Do you think it is similar to your focus on outcomes and applications versus the underlying statistics? Chris: Yeah, no, completely. I think and people will...so I make these flashcards. They're hand drawn, they're all about ML concepts. People have come to me over the years and been like, hey, you should really read more books about ML rather than flashcards. And I was like, well, one, I have read a lot of books about ML at this point. But the point of the flashcards and the point has always been one single thing, that ML interviews require a certain amount of rote memorization. There are people that try to throw you gotcha questions and I have received those questions, like describe a random forest from scratch and that kind of stuff. Those questions, it's just easier to just memorize them, right? To just sit down and memorize it. Interviews shouldn't be run that way, I totally understand that, we should all get to a better place where that's not happening. But yet it does happen, in most job interviews. For me, I just started making flashcards for it, like what is this concept? What is this concept? What is this concept, right? Like, can I do it, and just looking at the flashcards over and over again. From there, I just sort of developed more and more of them. And then other people got interested in that kind of stuff. But it is about getting that stuff into your brain. It's not something that you can read. If you read a thousand books, maybe you probably forget the concepts because there'd just be so many. Instead, these are the concepts that I've run into in interviews and other people have run into in the interviews, and like memorize it, memorize it and then regurgitate it back up, because you look really cool when you write an equation from scratch or something like that, because you had it in your brain. It goes back to the idea that I am interested in impact. I'm 1,000% interested in impact. A game is being played in an interview where they try to stump you with like describe gradient descent to me. That's a game, they're trying to throw a trick question at you. Crush it. Memorize the concept and then crush it in the interview. That's it. I wish people didn't throw those kind of questions, but they do. And so great, I will make flashcards to get past that part. It's less an issue now because I do more management stuff, so the questions are not so deep. But it definitely was a big part of my career. Especially for me, because people look at me with a social science background, like literally my PhD is in political science. People were like, oh, so you're a terrible coder and a non-technical person, so I'm going to throw you some gotchas in there. And just being able to memorize it and spit it out has been, frankly, a really useful tool. Lukas: When you interview a technical person, now that you're a manager, how do you approach that? How do you avoid gotcha questions? What questions do you ask to get at the competence of somebody's work? Chris: I actually really prefer to give people a choice of what they talk about. Some of the questions that I've really liked have been like, ""What algorithm can you actually describe in detail? Whatever you want, what's that one that you like? What's that one that you have, this is your go-to?"" I like that, because I'm not trying to say, ""Hey, in my experience this algorithm is important and therefore, if you don't know that particular one, you're not qualified."" It's instead saying, ""Hey, I want you to go deep, but you can pick anything that you get to go deep in and let's just jam out about it."" I have really, really, really appreciated that, because I have had candidates who come in who have been pretty nervous, and it can come off that they don't know what they're talking about, or something like that, and I'll throw that question to them, and they will just destroy it. They will just go so incredibly deep, and they'll start to geek out on it and they'll start to enjoy it, the whole interview process, because they get to talk about what they know and they light up about it. It is so fun to participate in that, and it shows you that people have this variety of expertise, because they did this particular ML model for four years, and they really, really, really know it. So you can say like, ""Okay, that's cool. It'd be fun to work with that person."" That's the kind of stuff that I have grown to like, because the fundamental truth about data science is that it's such a broad field. The questions that you get in an individual interview can be all over the place, from deep statistics, like I've had to write a statistical proof at one point, to model production stuff, like MLE, MLOps kind of things, like how would you architect a system to do this, to computer science stuff, to social science stuff, just all over the place. Frankly, I'm amazed that anybody passes these interviews. I've liked giving people the opportunity to dive into wherever they want. If they can't find the place that they really dive into, that's also a signal, right? Lukas: That makes sense. Well, we're almost out of time and we have two questions that we like to end with, that I think you'll have interesting responses. The second to last question is, what's an underrated aspect of machine learning that you think people should pay more attention to? Chris: Oh, wow. That's an interesting approach. I think the one that I really have started to like a lot is low-power models, so models that don't require...there's one direction that ML is taking, which is bigger and bigger and bigger and bigger models. It's sort of like getting a bigger and bigger, bigger truck, right? You just like, oh, what would be better? Two engines. You know what's better than two engines? 6 engines. You know what's better? 24 engines. I have really started to like, TinyML, like very, very, very small ML that you can run on a Raspberry Pi, and that kind of stuff. I think there's a pureness around it, but there's also like, creativity comes from constraints. Constraining yourself to very, very low resource settings is really interesting. I think it opens up stuff around cheaper smartphones and that kind of stuff, which...it's just a different direction than you're going to get from some of the really cool but huge models that take $24 million to train or something like that. Lukas: Totally, yeah. And even a Raspberry Pi is kind of big, I mean, try an Arduino. The final question is, what's the biggest challenge of making these models actually run in the real world? I mean, you're actually responsible for running models, what's the biggest challenge? Chris: The biggest one I think that I face is...well, I'll take a step back. When me and you were getting into ML, because we're both slightly older, I don't want to claim that your old, but you're around my age, ML was just starting off. You could totally join an organization and make any model and run it on your laptop, and it was better than the hard-coded thing that you were using and you were amazing, right? That's no longer the case. Now it's the case that they've had 10 years worth of models that they've made, all in these different settings, all in these different contexts. They're retraining models every single night, and so they have like thousands or 10s of thousands of models to deal with. A big part of what I found is hard is, how do you just manage all those models? And this is a real pitch for MLOps. It is hard to manage just all those models all the time and make sure they're all...not broken, not old data, they throw errors, there's dependency management around it. It is difficult to have, in the real world setting, hundreds of models going out all the time. Whether you're at a company or whether you're at the Wikimedia Foundation, it is just hard to do that. It is not a surprise to me that ML Ops has become the thing that is really, really helping people in this field out because it is something that is otherwise just difficult. It's insurmountable to think to do it yourself, because...it's easy when you have one model, right? You can be like, oh, let me think about this particular hyperparameter deeply after reading a book. It's another where it's like, we're going to be training 6000 models tonight. How do you keep them organized? How do you keep them up? How do you see how they're being used? How do you maintain them? That is a different game, which is where we're going for sure. Lukas: Awesome. Thanks so much. Great note to end on. Chris: Yeah. Lukas: Appreciate it, Chris. If you're enjoying these interviews, and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.",9916 +Emily M. Bender — Language Models and Linguistics,https://www.youtube.com/watch?v=VaxNN3YRhBA,4385,2021-09-09,"Emily: It's really important to distinguish between the word as a sequence of characters as opposed to words in the sense of a pairing of form and meaning. Because what the language model is seeing is only the sequence of characters. It's a bit easier to imagine what that's like if you think about a language you don't speak. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Today, I'm talking to Emily Bender, who is a Professor of Linguistics at the University of Washington, who has a really wide range of interests in linguistics and NLP, from societal issues to multilingual variation to essentially philosophy of linguistics. I'm especially excited to talk to her because she was actually my teacher for Linguistics 1 at Stanford University, where I was an undergrad. It was one of my favorite classes. I still remember it. I still remember a whole bunch of interesting facts that I learned. And it led to this lifelong interest in linguistics that I've really enjoyed. So, could not be more excited to have a conversation with her. I thought it might make sense to start with the paper that you coauthored, ""On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? ""¹, which was notable even to me on Twitter for a lot of controversy at Google, which I was hoping you could maybe start by describing, but then get into the meat of what the paper actually says. Emily: Yeah. So, it's not in the IPA and hard to pronounce, but the title actually includes an emoji. The last character of the title is a parrot emoji. We were doing that just kind of for fun, because we liked the stochastic parrots metaphor, and there was a while before all this happened that we thought the thing about this paper would be it was the one with an emoji in the title. Little did we know. The paper came about because of work that Dr. Timnit Gebru and Dr. Margaret Mitchell and their team were doing at Google, really trying to connect with the engineering teams to build in good practices to make the technology work better for more people and do less harm in the world. So, that was sort of the role that they had there. They noticed, especially Dr. Gebru, that there was this big push towards bigger and bigger language models. The paper has this table that just...the number of parameters and the size of the training data just explodes over the past couple of years. Dr. Gebru actually direct messaged me on Twitter saying, ""Hey, do you know of any papers that talk about the possible downsides to this, any risks? Or have you written anything?"" And I wrote back and I said, ""No, I don't know of any such papers and I haven't written one. But off the top of my head, here's five or six things that we can be worried about."" About a day later, I said, ""You know what, that feels like a paper outline. So, here's a paper outline, you want to write this?"" That was early September, and the conference we decided to target was FAccT, the Fairness Accountability and Transparency Conference, which took place finally in March 2021. Submission deadline was October, I think, 8th of 2020. So, in a month, we put together this paper. That was possible because it actually wasn't just the two of us writing it or the four named authors finally, but in fact we had seven authors. So, Dr. Gebru brought in Dr. Mitchell — it's really important to me to emphasize that they have doctorates, but I also know them well enough that I'm going to start full naming them now, or first naming them actually — so, Timnit brought in Meg and three other members of their team, and I brought in my PhD student, Angelina McMillan-Major. Between the seven of us, we sort of had enough different areas of expertise and literatures that we've read that we could pull together this survey paper. And so, it came together. It was amazing, and also an interesting writing experience because we never had a Zoom meeting or anything where all of us spoke together. It was all done through remote collaboration in Overleaf. So, not a super common way for a research to get done, but it worked in this case. The Google authors put it through what they call pub approve over there, got approved. We submitted it to the conference and then put it away, because none of us had actually anticipated working on that in the month of September. It was like extra work for everybody. so we all turned back to the other stuff we needed to be doing. In late November out of nowhere, from my perspective — and I should say that in telling the story, I'm not at Google, I've not been funded by Google, and so I only have sort of secondhand understanding of what went on at Google plus what was out in the press eventually — but the Google coauthors were told to either retract their paper or take their names off of it, and they weren't told why. They weren't offered a chance to discuss what might need to be changed about the paper, it was just ""retract it or take your names off of it"". We had this strange moment of, ""Okay, what do we do with this paper?"", because it seems kind of odd to put something out with just two authors that actually represents the work of seven people, what do we want to do here? My PhD student Angie and I, we turned to the Google coauthors and we said, ""We will follow your lead here. What do you want to have happen?"" And they said, ""No, we want this out in the world. So, you two publish it."" That was the initial answer. And then Timnit, sort of on reflection, said, ""Actually, this is not okay. This is not an okay way to treat a researcher who was hired to do this research."" This was literally her job and the job of everyone on that team. And so, she pushed back. The result of all that, you can go find in all the media coverage, is that she got fired. Google claimed she resigned. Her team says she got resignated, which is a great neologism. That went down fast enough that she was able to then put her name on the paper. Meanwhile, Meg started working on documenting what had happened to Timnit. The end result of that was that she was fired a few months later, but after the final version of the paper was done. So, that's why the fourth author is Shmargaret Shmitchell. That's a really sad story for everybody involved. I mean, it's terrible mistreatment of Timnit and Meg and the other members of the team, those who are on our paper and those who weren't. It's become, I think, a really difficult environment to work in. It's sad for Google because they lost really wonderful expertise and a lot of goodwill in the research community. And sort of sad for...it sheds a light on the sad state of affairs about the way corporate interests are influencing what's happening in research in our field right now. On the other hand, my coauthors and I still maintain...we all really enjoyed the experience of working on this paper together and of weathering the stuff backwards together. One weird result is that this paper has gotten way more attention than it ordinarily would have. I think it's a good paper, it's a solid paper. And boy, did we put a lot of polish on it between the submission version and the camera-ready because we knew it was going to be read by a lot of people. When I put up the camera-ready as a preprint...I didn't put it on arXiv, because those tend to get cited instead of the final published versions. So, I just put it on my website and tweeted out a link with a bitly link to shorten it, so that I could see how many times it was downloaded. It has been downloaded through that link alone over 10,000 times. I know that there's other ways to get to it, which is way out of scale to anything that I've ever written otherwise. So, that's been interesting as a researcher, but it's also I think fortunate because it has come to the attention of the public, and I think that this technology is widespread. It's being used. It's being used in lots of different ways. It's really valuable that the public at large has a chance to understand what's going on. And so, through Google's gross misstep, I and my coauthors have been given the chance to help educate the public, which is something that I do feel fortunate about. Lukas: I'd love to kind of get into what the paper talks about. But do you have any sense, or has Google made any comments, about what their objection was? Because I sort of had this feeling that it must be a really incendiary paper, and then in the prep for this interview I actually read it and it felt pretty uncontroversial, I guess, was my feeling reading it. So, I just wonder...I mean, maybe it's hard to know, but have they said anything about what they didn't like about it? Emily: Yes, I mean, in public comments, there's been things like, ""It doesn't cite relevant work that is trying to mitigate some of these issues."" But at no point were we ever told which work we should have been citing. And we do, in fact, cite some work that is trying to mitigate these issues. So, I don't know quite what that was about. But you're absolutely right. We figured that we'd be ruffling some feathers with this paper because we were basically saying, ""Hey, this thing that everyone's having so much fun chasing, maybe let's go a little bit slower and think about what kinds of downsides there are and how to do this safely."" There's going to be some who don't want to hear that, but we honestly thought it was going to be OpenAI who was upset, because GPT-3 is kind of the best known example of this, and it was our running example, too. So, we thought we'd ruffle some feathers, did not realize we were going to be ruffling feathers inside Google. And, it's basically a survey paper. We didn't run any experiments. We didn't do any analysis. What we did was we pulled together a bunch of different relevant perspectives on large language models and brought them all together in one place. It is surprising that the paper seems to have been part of the cause of Google basically blowing up this amazing asset that it had in terms of its ethical AI team. Lukas: Interesting. And I guess, one reading of your paper is, ""Hey we should consider the downsides of large language models."" I think maybe another person might read it...this might be an unfair reading, but I could imagine someone having hurt feelings if they were working on large language models, and they read your paper saying it's like an unethical thing to do, to build large language models. Would that be an overstatement of your claims? I don't have the paper in front of me, but I think maybe that could hurt feelings, I'm not sure. Emily: I also do a lot of work in the space...societal impact of NLP in general, and that sometimes goes under the title of ethics and NLP. I do see a lot of people reacting to that topic with hurt feelings, and I think it's connected with the way in which people identify with their work. If you say, ""Hey, let's think about this technology we're building and how it behaves in the world and what we can do to make it be beneficial,"" and you use the term ethics to describe that, sometimes people want to read that as, ""You're calling me unethical,"" and I think that that direction of the conversation is rarely actually valuable. I do think that, in general, people in this space want to be doing good things in the world. Certainly, there are people who are working on technology with the goal of making a lot of money doing it. There's this caricature of the tycoon or whoever, who's just happy to crush all the little people to make as much money as possible, that's out there, probably. I think much more frequently, people are working within systems that give them certain commitments around maximizing value for shareholders and stuff like that, that make it harder to put on the brakes on some things that are making money right now for shareholders and take a bigger picture view. But it is much more valuable to talk about it in terms of what are those systems, what are the incentives, what can we as individuals do within those systems, rather than think about people as ethical or unethical. I'm not sure that really speaks to your question, but hopefully it's somewhat helpful. Lukas: No, I mean, I think you're saying that your point is a little more nuanced than what maybe someone would take away, and I can...I run a company and I love technology and I love building...I do recognize that lots of people get hurt, and I think it's great that people are pointing out issues and also kind of pumping the brakes and flagging this stuff. But I could kind of see how someone might feel a little offended by it. I wasn't sure if I was kind of jumping to something or- I guess my question, well, the question that I kept thinking about with the whole paper in general as I was reading it is, even sort of setting aside making money, let's just talk about research and just the excitement of building models that work. I just feel that so deeply, like GPT-3 for all its fuss, it's kind of amazing what it does. I wouldn't have expected it to work so well. Would you feel...would you argue that those kinds of directions of research should stop? Or what would you want an organization like OpenAI to do differently? Ethics is a good example of a place that's kind of actually really showed that bigger models do kind of... it's not obvious that bigger models would perform tasks better at many extra orders of magnitude. Would you prefer that that research doesn't happen or happen differently somehow? Emily: So, I think it's worth saying that OpenAI has actually put a lot of effort into thinking about ""What are the possible downsides?"" and ""What could happen when this technology is released in the world?"", and that's important to note, and I'm glad that they're doing that. I think that what I would like to see more of is...first of all, that kind of work. What are the possible failure modes and how do they impact people? And then also, when this is working as intended, how can that impact people? OpenAI has been doing some of that and I think that's great, and they should do more. But also, you can look to other fields of engineering, where before you take something and you put it into the world in a place where people are going to rely on it, there's all kinds of testing that has to be done in sort of understanding of ""What are the tolerances?"" and ""What works and what doesn't?"" and ""What's the range of temperatures that this thing could be applicable in?"" and ""What are the things you have to check for and certify?"" and things like that. We don't have very much of that yet going on in NLP. I can speak less to other areas of AI, but I honestly think there's similar issues elsewhere in AI. And so, there's work — actually that was done at Google by Meg Mitchell and Timnit Gebru and others on a framework called Model Cards, which was sort of steps in that direction of like, ""You've built a model, what does somebody who's going to use this model need to know about it?"" — and that's the kind of thing that I would like to see more of. And that is in contrast to just rampant AI hype, where people build something, it's cool, it's fun, it works well, and somehow, that's not enough. People have to say...it's not enough that GPT-3 can produce coherent text, people have to say it's understanding language, which it absolutely isn't, as I'm sure we'll talk about later. Lukas: You have two good segues, but yeah, yeah. Emily: Yeah. Although it is all connected, right? Lukas: Yeah. Emily: So, for some reason, the culture around AI is all about trying to reach for these big claims rather than trying to build really well-scoped, reliable — sufficiently documented that they can be used safely and reliably — systems. That's the direction that I would like to see more of, is one thing. And then another thing, and we get into this in the paper, is that if the main pathway to success these days is just bigger and bigger and bigger, then you cut out lots of languages communities, even within the languages that generally are well supported, because they just can't amass that much data. And you also cut out smaller research groups, smaller companies that are not sitting on the kind of collections of data that Google is, or Facebook is, or Amazon is. Microsoft also does a bunch of big data work, they don't seem to have amassed data quite the same way as the other big ones. That is unfortunate because it, I think, stifles creativity to a certain extent. If the whole community is rushing towards this one goal that only some can really effectively do, then we lose out on the other things that people might be trying instead. Lukas: Maybe a less obvious concern that you talked about in the paper is talking about how the models can encode bias in ways that are hard to notice. When you talk about the harms that might happen from natural language models, do you have examples of things that are actually happening now? Or is this more of like a future-looking thing that we're worried about as NLP becomes more pervasive, like worrying about future harms? Emily: So, I mean, absolutely happening now, and therefore easy to predict that it will keep happening in the future if we don't change. Here, the work of Safiya Noble, with her book ""Algorithms of Oppression"", is a really important documentation of this. She looked into what are the ways in which identities — which properly belong to the groups of people who have those identities — are represented and reflected back to people in search. In particular, her running example is the phrase ""black girls"" and also ""black women"". These things have changed over time — and she's very careful to document when she talks about particular examples what the date was — but early on, as she started this project, the phrase ""black girls"" as a search keyword basically turned up pornography. And that you might say is, ""Well, that's just in the data."" Well, what data? Where did that data come from? If you get into the heart of her book, it's basically around that that's ""in the data"" because of the way in which the economy of the internet allows people to purchase and make money off of identity terms. Once these things were flagged, Google sort of piecemeal [made] changes, so you don't get pornography as the results for the search term ""black girls"" anymore. But it's also possible to sort of poke at things and tell that it's very much individual after-the-fact changes, as opposed to anyone going through and systematically thinking about how to redesign the way that search engines and the advertising-driven ranking of search latches on to these incentives and then amplifies them. One ongoing discussion in the AI community — you see it pop up on Twitter with great regularity is — is the problem that the data is biased only, or do the models also contribute? And the answer is absolutely models also contribute. And then, there's other layer to it of, ""Well, that's just what's in the data."" One of the other really embarrassing examples for Google was, there was a point at which Google Image search turned up pictures of gorillas when you were searching for black people, and I forget exactly the particular configuration of that, but embarrassing and awful and racist. One reaction at the time was, ""Well, that's just in the underlying data."" And so, ""Not our fault. We're just showing what the world is saying."", except that it's not true, because the way the algorithms that do the ranking of search results and also the bidding for the ad words is...that is emphasizing particular incentives. So, there is a certain thing in the underlying data. There's also the question of how did you collect that data? Where did it come from? What does it actually represent? It is not the world as it is. It is some particular collection of data. And then what is the optimization metric? What are all these modeling decisions that you've made? And how does that interact with the various biases in the data? And what's the incentive structure? Safiya Noble's work is a great point to look. Latanya Sweeney documented — this is a 2013 paper — how if you put in, at that point, an African-American sounding name, one of the ads that would pop up suggested that that person had a criminal history. And if you put in a white sounding name, you tended to get just a more information about so-and-so. And that does real harm in the world — it wasn't 100%, but it was significantly different between the two groups of names — it does real harm in the world because if you imagine someone is applying for a job or just making friends and someone does a Google search on them, and here comes alongside this message suggesting they might be a criminal, that does harm. And then if I can give one more example— Lukas: Please, yeah, these are great. Emily: Elia Robyn Speer did a really interesting work example around sentiment analysis and word embeddings. Sentiment analysis is the task of taking some natural language text, and her example is English, and using it to calculate or predict the sentiment. Is this a text expressing positive feeling towards something, negative feeling towards something, or not expressing feelings? The particular data set she was working with, I think, was Yelp restaurant reviews. So there, it's ""Take the text, predict the stars."" Lukas: Yeah, I've used that data set. Yeah, for sure. Emily: And then as an external component, she's using word embeddings, which are representations of words into a vector space based on what other words they could occur with. So, some of the training data is in-domain, the Yelp reviews, but then there's this component that's trained on general web garbage. What she found using the sort of generic word embeddings was that the system systematically un-predicted the star ratings for Mexican restaurants. All right, so she digs into it and looks into why. It turns out that because that general web garbage included the discourse about immigration into the US from and through Mexico, which has lots of really negative toxic opinions of Mexican people, the word embeddings picked up the word ""Mexican"" as akin to other negative sentiment words. And so, if in your review of the restaurant you called it a ""Mexican restaurant"", according to the system, you have said something negative about it, so you can't possibly be giving it a five-star review. Lukas: Well, that's a really interesting example. My next question was going to be how do models play into this? I guess that's a good example of how not just the underlying data can have bias, but the model can literally have its own bias. Emily: Yeah, so the word embedding picked up on co-occurrences between the word ""Mexican"" and lots of other things that also co-occurred with negative sentiment, and then that was used as a component in this other model. So, yeah, there wasn't in the underlying Yelp reviews any particular reason that the Mexican restaurants were rated lower, right? Lukas: Right. Emily: I don't know for sure if they were rated on average exactly the same, but it doesn't matter, because the error was the system underpredicting for any given restaurant. On average, it was missing in the low direction. So, yeah, that's a kind of bias that was picked up from an external data set. We tend in NLP to use word embeddings as really handy detailed representations of word ""meaning"", so word similarity, including semantic similarity. And if we don't pay attention to what meaning was picked up, what co-occurrence was picked up, then we can end up with stuff we really don't want in our systems. Lukas: What would you recommend doing about that? Because they are really useful, word embeddings. And I'm sure in this case, it seems pretty simple. It's hurting your performance. There's not even a model performance tradeoff here. So, what could you possibly do? Emily: There is a lot of work on so-called debiasing of word embeddings. If you look at Speer's work, she continues on to do some of that. And I think that part of it is, work with more curated datasets. The discourse around immigration from and through Mexico, even if you stick with only things like reputable news sources, you're still going to find that garbage. That alone is not going to solve it, but it can be better. It's not possible to come up with a fully bias-free dataset nor fully bias-free word embeddings, but you can do better. One step is to sort of say, ""Okay, how much better can we do with curated data? What about debiasing techniques for the biases that we're aware of?"" Part of the problems with debiasing techniques is that you have to know what you're looking for. And then on top of that, to think through failure modes. So, in a particular use case, when you're building some technology, who are the stakeholders? Who's going to be impacted by it? If someone's restaurant rating is underpredicted for some reason, what does that mean in an actual use context? And what should we be testing for to see if we have sufficiently debiased for our use case, for the stakeholders who are most likely to experience adverse impacts? Lukas: I guess, it does seem like it would be incredibly...I mean, it seems like it'd actually be impossible to find an unbiased dataset of human... Emily: Right. It doesn't exist. Lukas: I guess these are good segues into other papers that I want to talk about. So, maybe we should just in the interest of time, we should move on to the second paper² that we want to talk about to make sure we get to it, which is around... Let me see if I can summarize this. So, this is basically sort of saying that language modeling only on what you call ""form"" — which I think is just sort of like the words coming through, this is kind of of the GPT-3 types of models that just like look at these strings of words — can't have understanding, like true understanding. I just thought one thing that was interesting is that you said you wrote the paper to sort of end some kind of debate on Twitter that I was definitely not aware of. Actually, I think I'm kind of coming into something with maybe more context than I knew. So, maybe you can sort of summarize what the different possible positions are here and what you want to put to rest. Emily: So, I kept finding myself getting into arguments on Twitter with people who were claiming that language models were understanding things. And I was like, ""No, they're not. They can't possibly be."" It's important to pin down what we mean by language models. So, a language model is something like GPT-3 or BERT or otherwise, where its training data is a whole bunch of text, and the training task is predicting words in the text. So, some of the times it's done sequentially, sometimes it's done with a masked language model objective, where certain words are dropped out and the training objective is, ""Okay, well put those words back in and then do your model updating to..."", gradient descent, et cetera, et cetera. For me, as a linguist, I look at that and go, ""Hey, useful technology, interesting. Incredibly helpful in things like speech recognition and machine translation where an important subtask is, 'Okay, what's the likely string?'"" So, in a speech recognition setup, the acoustic model says, ""Here's a range of text strings that sound might have corresponded to,"" and then the language model comes in and says, ""Okay, yeah, but 'It's important to wreck a nice beach.' is a ridiculous thing to say, and 'It's important to recognize speech.' is a reasonable thing to say, so we're going to rank that one higher."" That's the kind of form-based tasks that they were initially meant for and good at. And then what's happened with the neural language modeling revolution in the past few years is that when you extract the word embeddings from a language model, you have a really finely fitted representation of word distribution, which is very useful, and some of them can even do...where you get the word embeddings are contextual. So, the information about the word and what it's likely to co-occur with isn't about that word across all the texts, but about that word in its current context. Super useful, but not the same thing as understanding language. I kept getting into arguments with people who were not linguists who wanted to say, ""Yeah, it is."" So, Alexander Koller and I wrote this paper to just sort of say, ""Okay, look, here's the argument why not,"" with the hopes that that would put an end to it, and it didn't. People still want to come argue with me about this. The thing that is really hard to see — and sort of the value of linguistics in this place — is that when we use language, we use it...and I'm sorry, I'm going to pull out a philosopher on you here, but Heidegger has this notion of throwness. So, you're in a state of throwness when you are not aware of the tool you are using. If you think about typing on a keyboard, when it's going well the keyboard disappears. And then, you have a key that sticks and then all of a sudden, the keyboard is very ""there"" for you again. Well, language is the same way. When we are speaking a language that we are fluent in, it is not very visible to us until something makes us focus on it. And of course, linguistics is all about focusing on the language. So, linguists are used to doing that. When we talk about giving words to a language model, it's really important to distinguish between the word as a sequence of characters as opposed to word in the sense of a pairing of form and meaning. Because what the language model is seeing is only the sequence of characters. It's a bit easier to imagine what that's like if you think about a language you don't speak. So, what's a language you don't speak? Lukas: Mandarin. Emily: Mandarin. Okay. You don't speak Mandarin. I assume you also therefore don't read Mandarin. Lukas: Definitely don't. Emily: Maybe recognize a couple of the characters? Lukas: I mean, I read Japanese. So, there's some overlap. Emily: Okay, so let's go further away. Do you read Cherokee? Lukas: No, definitely not. Emily: Okay. So, Cherokee has got this wonderful syllabary, it's a writing system where the characters represent syllables. If someone showed you a whole bunch of Cherokee text, that experience of looking at it would be a better model for what the computer is doing than you looking at English text, because you can't help but get the meaning part when you're looking at it. Because English is a language you speak and read. Mandarin is kind of in between there because you would pick up a few of the hanzi that you recognize from Japanese kanji, and it wouldn't be quite the same. Lukas: I guess...I don't know, I don't want to argue with you. But I do want to, I guess, advocate for... I don't know, I mean, I have not thought deeply about this topic. What I have seen in my life is these language models working better and better than I could have imagined from the strategy that they employ and sort of seeming like they're getting more and more subtle detail. Of course, when I was a kid, I learned about the Turing test, which seems like a pretty good test of understanding on its face. I think the test is like, if you have a conversation with something and you can't tell if it's an automated system or a human, then we can say that it has intelligence, tand it sort of seems to me like these language models are on the verge of passing the Turing test. What would it take for you to feel like some automated technique actually has understanding of what it's consuming? Emily: Yeah. So, I think the first thing I want to say about the Turing test is the reason it doesn't work...and I hate to disagree with a giant like Turing, because Turing's work was really important and foundational— Lukas: But it was 100 years ago, it's possible to miss something. Emily: 70? Lukas: 70, fair, fair. All right, 80? 70? Okay, 70. Sorry. Emily: As it turns out, people are too willing to make sense of language and too willing to sort of build the context behind something that would make something make sense. And so, we are not well positioned to actually be the testers in a Turing test. That's why that doesn't work. Language models, because they can come up with coherent-seeming text. These are probable sequences, given a little bit of noise and where you start, what would likely come next based on all that training data. Then it sort of comes out as something that we can make sense of, and then we are sort of easily fooled into thinking that it actually meant to communicate that. So, you're asking the question of ""What would show that a machine has understanding?"" I think part of it is, well, let's talk about actually interfacing with the world in some way. We certainly do have cases where machines in restricted domains for restricted ranges of things that they can do, do understand. So, when you ask your local corporate spy bot to do something for you and it does the thing, it has understood. Lukas: Wait, sorry, what's a local corporate spy bot? Sorry, could we make this a little more concrete? Emily: I'm making a snarky remark about the privacy implications of things like Siri and Alexa and Google Home. Lukas: Oh, I see, I see, gotcha. Emily: And Samsung Bixby is in the same space. Microsoft had Cortana. Right? Lukas: Right, right. Gotcha. Emily: So, when you ask those things to set a timer, or turn on the lights, or dial a phone number or whatever, and it works, then, yes, to a certain extent, it has understood. And it has understood because its training setup was looking at not just language but something external to the language that needed to map to that. And so, that's a kind of understanding. The question is — for somebody who was interested in doing that across some more general range of things — the question is, how do you set up tasks that require some kind of action in the world, so that it can't be done just by bulldozing it with a language model and say ""Well, this is a likely thing to come next."", right? Lukas: You got to describe your octopus thought experiment because that was very evocative. And I have some questions. Emily: Okay. So, the octopus thought experiment is about not just being able to understand but learning to understand. That's the difference between it and both the Turing test and Searle's thought experiment, where both of those basically say, ""Imagine someone has set up the whole system."" Then we could test for intelligence or we can...from a philosophical point of view, say it's still not understanding. So, the system exists and we are thinking about it or testing it. The octopus is this thing of saying, ""Okay, if we had something that we assume, we posit that it is hyperintelligent..."" and then that's part of why we picked the octopus. In fact, it was initially a dolphin, but we decided that octopuses are inherently more entertaining. Also, it was better because dolphin's environment is a bit closer to a human's environment. So, we wanted the octopus to be something that is posited to be super intelligent. And they are, I think, understood to be intelligent creatures...as smart as it needs to be. That's not the issue. We are assuming intelligence, but then we are only giving it access to the form of language. In our scenario, you have these two English-speaking humans who end up stranded on two nearby islands. They're otherwise uninhabited, but they've had previous inhabitants who set up an undersea telegraph cable. These two humans can communicate with each other. We left it offstage how they discovered the telegraph or that the other one's on the other end, whatever, just assume it exists. It's the thought experiment, you can do things like that. You know, assume a spherical cow, except we don't need spherical cows. So, telegraph cable and the humans are named A and B. They're basically using English as encoded in Morse code to talk to each other. This hyperintelligent deep sea octopus that we called O comes along and taps into that cable. The octopus can feel the pulses going through for Morse code. The question is, what could the octopus actually potentially learn here? Because this is a hyperintelligent octopus – it's got as much time as it wants, as much memory as it wants — it is able to very closely model the patterns of what's likely to come next. In our story, the octopus decides for some reason that it's lonely and it's going to cut the cable and pretend to be B while talking to A. On reflection, it's like, ""Poor B, just cut off from the world."" So, maybe the octopus is also talking to B pretending to be A, but we don't talk about that part. The question is, under what circumstances could the octopus continue to fool A that it's actually B? We say this is in a sense, a weak version of the Turing test. The way the Turing test was set up, A is given the task of deciding ""Am I talking to a human or not?"" And here, there's subterfuge. The octopus, its mere existence is unknown to A. If there's just sort of like chitchat pleasantries, those things you can just kind of follow a pattern and it's relatively inconsequential as long as what's coming out is internally coherent. And even if it's a little bit incoherent, well maybe B is just being silly. It doesn't matter so much. Okay, well, O could get away with that. But once you get more towards things where A actually really cares about communicating ideas to B and getting ideas back from B, it's going to get harder and harder for the octopus to maintain this semblance of good communication. We go through this example where A builds a coconut catapult and the octopus is able to send back sort of like, ""Very cool invention. Great job."" or something, even though A was asking for like, ""Well, what happened when you built it?"" But the octopus has no experience of things like coconuts or rope or stuff like that. So, it can't reason about those things in the world, or even know that A is actually talking about them. All it can do is come back with, ""Well, what's the likely form of a response in this context?"" To the extent that O gets away with that, it's because A is willing to make sense of those utterances. O has no meaning in this scenario. And then finally, we have a bear show up and start attacking A, and A says to O — or to B, actually — ""Help. I'm being attacked by a bear. All I have are these two sticks, what should I do?"" At that point, O is utterly useless, and so we say this is the point at which O would definitely fail the Turing test if A survived being eaten by the bear. But then we tried with GPT-2 like, what would it say? The answers were hilarious. The words are in the right topic area enough that it comes back with something funny and I encourage people to go look at the appendix to our paper where we put these, but it's never going to be helpful. And it's not actually expressing communicative intent. Lukas: Well, I have to say, walking into that paper without knowing the context, I really enjoyed it. For me, I especially enjoyed it because the sort of concreteness of the thought experiment that was like evocative but also makes you think, ""Huh, what do I think about that?"" What I kept thinking was...for me, I feel like I've learned about a lot of things that I haven't experienced, I was especially thinking about learning math, where there's all these abstract topics. I feel like in a way I learned about math, in some sense, through form almost. It's all in my head. I'm like learning things as visualizing them. It seems possible to learn to reason about things that you haven't seen or experienced just from a stream of words. I even remember actually grading blind student's papers. It was really interesting, how they walked through stuff in a math class, and it seemed like they were visualizing things even though they were blind from birth. So, I'm just wondering ... I guess, I'm not totally convinced that the octopus couldn't somehow figure out what a catapult does if they listen to all language. Emily: So, if the octopus had actually had a chance to learn English, then yes. But it didn't because it never got that initial grounding. And we absolutely learn things through language that are outside of what we've directly experienced. Conversely, if you as a sighted person wanted to understand what it was like to live as a blind person, you could listen to or read what a blind person has to say about that and learn about it. So, that's definitely something that we can do. But we can do it because we have acquired linguistic systems. When we use language to communicate, we absolutely tell each other ideas and things that are outside of even our own experiences, right? We invent things, and then transmit that to other people. But we do that based on this shared system that tells us, ""Okay, here's the range of possible forms. These are the well-formed words and sentences. These are the sounds that we use in this language. These are the way the words are built up, the sentences are built up. And these are the standing meanings that they map to."" And then we use those standing meanings to make guesses about communicative intent. The problem for the octopus isn't that it's not smart. We said it's hyperintelligent. It isn't that if it knew the language, couldn't understand those things. It's that its exposure to the language is not set up so that it can actually learn it as a linguistic system, all it can learn is distributional patterns. Lukas: I guess what prevents the octopus from learning language over time like a human probably would? Emily: Okay, so, it doesn't get to do...and in the paper, we go into human language acquisition. For first language acquisition, it's all about joint attention. When babies learn language, it starts from social connections to their caregivers and understanding that the caregivers are communicating something to them, and then mapping the words onto those communicative intent. The child language literature talks about the importance of joint attention, that kids learn words when their caregivers follow into their attention, and attend to the same things, and then provide those words. That experience, that mapping, the octopus doesn't get that. It's just getting the words going by Lukas: Do you think there's some algorithm possibly that could exist, that could take a stream of words and understand them in that sense? Emily: Natural language understanding is a tremendously difficult problem because it relies not just on the linguistic system, but also on world knowledge and common sense, reasoning, all kinds of things. So, you can certainly use — I'm more certain than I actually am — but there's a big difference between saying, ""I'm going to build an algorithm that has understanding of linguistic structure, has understanding of linguistic meaning, has understanding of how those meanings map to a model of the world, and then use that to understand,"" versus ""I'm going to build a system that only gets linguistic form and assume that it will get to understanding in some way."" So, yes. You could go much, much further with algorithms that have more in their input, in their training input, than just form. That's going to be things like visual grounding. It's going to be things like the ability to possibly query people for answers. It might be knowledge bases. It might be other sensors in some sort of embodied... I'm not saying that natural language understanding is impossible and not something to work on. I'm saying that language modeling is not natural language understanding. Lukas: But just so I'm clear, just consuming language without kind of all this extra stuff, you're arguing that no algorithm could from just that really understand language? Emily: By language, I mean, form. Imagine that you are dropped into the Thai equivalent of the Library of Congress, and you have around you any book you could possibly want in Thai, but only in Thai. For some reason, this library doesn't have Thai-Chinese, Thai-French, Thai-English dictionaries. It's just Thai. Could you learn Thai? Lukas: I think so. I guess what's hard is that I have a language already. But I feel like I- Emily: So, what would you do? What would be your first step to learning Thai if you have just oodles and oodles of Thai books and that's it around you? Lukas: What would I start to do? I mean...I'm not sure. Do you think I couldn't learn Thai? Emily: So, I'm curious about what you ... So, you as a person, could you learn Thai? Sure. You could go take a Thai language class. Lukas: No, no, I mean from in this situation, just sort of dropped in. I mean people do learn... How did people learn hieroglyphics or something when there's no one around that still knows it? Do they need to find like a Rosetta Stone? Or can they- Emily: The Rosetta Stone is what unlocked the hieroglyphics. If you don't have something like that, then what you have to do is resort to hypotheses about distributions and say, ""What do we know about the world in which these texts were written? What do we know about how languages work?"" Can we say, ""Okay, well given frequency analyses and the length of the words, it seems like a language that's got separate function words instead of lots of morphology. So, that thing might be an article, that thing might be a copy of a verb,"" and you could do some analysis like that. It's not what language models are doing. To get from those sort of structural things into something about meaning, you have to make guesses about what's being described. You have to basically bring in some world knowledge and say, ""How well does this fit?"" When I asked you the question of what would you do, I was thinking, well, possible answers are, ""I would go find an illustrated encyclopedia that has pictures in it."" There's some visual grounding. Or I would go find a book from whose cover I could tell it was actually the Thai translation of Curious George. Lukas: These are great suggestions. Emily: Yes. But all of that is bringing in external things. And then once you have a foothold, you can build on it. That's an interesting way to go. But if you just have form, it's not going to give you that information. Lukas: Wow, interesting. Thank you, this was really interesting. I guess, my last question on this topic is, do you sort of predict that these language models will run into problems that we'll really experience and then we'll have to kind of change the approach? Or do you think that as our bar for applications of natural language goes up, they'll just sort of adapt and find ways to incorporate external information, kind of like finding the Curious George translation? Emily: I think that language models are going to remain useful. I mean, language models have been an important component of language technology since Shannon's work in the 1950s. This is longstanding. But I think that we are likely — it's so hard to predict the future — but my guess is that...or maybe what I would like to see is that we get to a more stringent sense of what works and what sort of an appropriate range of failure modes and what kind of fail safes we need. People are going to find that putting language models at the center of something where your application really requires you to have a commitment to the accountability for the words that are uttered is going to be a very fragile way to go. My guess is that when we get to that point, we're going to de-center the language models and have them be something that is selecting one possible output again or providing these word embeddings, but they are not a step towards general-purpose language understanding the way they are hyped to be. That's one set of problems. If you have to have accountability for the words that are uttered, you do not want a stochastic parrot. You want something that will speak for you in a reliable way, not just make up what sounds good. The other thing is if we take seriously these issues around bias and encoding and amplifying bias and training data, I think we're going to find that we want to work with algorithms that can make more of smaller datasets, so that we can be better about curating and documenting and updating those datasets so that they stay current with what's going on, rather than this path right now that relies on very large language models. So, those are my guesses. There's also the environmental angle. Well, actually, the ""energy uses"" angle is both environmental but also about technology, to a certain extent. I think there are more and more people — and there's Schwartz et al, Strobel et al, Henderson et al. — a bunch of work now saying, ""Hey, let's make sure we're also measuring the environmental impact as we do things, or the carbon footprint so that we can direct effort to doing things in a more and more efficient way."" There's that angle, but there's also many situations where you don't have the whole cloud available. If you want to do computing on a mobile device, you're not going to be able to have an absolutely enormous language model in there. There's pressure to find leaner solutions. I think that's a win-win, environmentally and then in terms of more flexibility with technology. Lukas: Totally, totally. And it's a good segue because you pointed out a bunch of this stuff in your papers about benchmarks, which I'd love to talk about a little bit, and maybe you could kind of summarize...maybe start [with] what are benchmarks, probably most people know, but then what are the possible pitfalls with them? Emily: Yeah. I should say this is a paper called ""AI and the Everything in the Whole Wide World Benchmark""³ that we presented at a workshop called Machine Learning Retrospectives at NeurIPS last year. It's joint work with Deb Raji, Alex Hanna, Emily Denton, and Amandalynne Paullada. Another collaboration where...in this case, we actually do have meetings where we talk to each other, but of those people, the only one I've met in person so far is Amandalynne, who's a PhD student in my department. Pandemic life, right? We got together because we were talking about the ways in which benchmarks are being misused in the AI hype machine and in AI research that is striving for generality and overclaiming what the benchmark shows. So, a benchmark is basically a standardized data set, typically with some gold standard labels. Although you could also have benchmarks for things where the labels are inherent, like language modeling. What word actually came next is the gold standard level. The idea is that you might have a standardized set of training data, or possibly not, and then you've got the standardized test data. People can test different systems against this. You have this chance of saying ""Which system is more effective in this training regime?"" or ""Given this training data against that test data?"" So, that's a benchmark. You asked me before if I could summarize the problems with benchmarks, and it's not so much benchmarks I have a problem, but the way that they're used. I think this is an example of ""the map is not the territory"". People will tend to say, ""Oh, here's this benchmark about computer vision."" ImageNet is that. Or, ""Here's a benchmark about natural language understanding of English,"" and that's GLUE and SuperGLUE. People will say...I've actually seen this in like a PR thing that came out of Microsoft saying that computers understand English better than people now, because this one setup scored higher than some humans on the GLUE benchmark. That's just a wild overclaim. and it's a misuse of what the benchmark is for. So, what's the problem with the overclaims? Well, it kind of messes up the science. We're not doing science if we're not actually matching our conclusions to our experiments. We live in a world of AI hype, which means that people are more likely to buy in to and set up solutions that don't function as advertised because they live in a world where people are being told that Microsoft has built a system that understands English better than humans do. Of course, you could also build an AI system that does whatever other implausible thing like, ""Guesses someone's political affiliation by the way they smile"" or something, which makes no sense. But we live in a world where there's all these claims, overclaims about AI, and that makes these other ones also sound more plausible than they should. So, those are the problems that I see. But, benchmarking is important. In the history of computational linguistics, there was a while where when you wrote a paper for the ACL, the Association for Computational Linguistics, you would say, ""Here's my system. Here's how I built it. Here's some sample inputs and outputs,"" done. Then the statistical machine learning wave came through and brought with it the methodology of shared task evaluation challenges, which is sort of a historical version of benchmarking, where MIST and other organizations would say, ""Okay, we want to work on speech recognition, and we want to actually get a sense of how these different systems compare to each other. So, we're going to run a shared task evaluation challenge where everyone gets the same training data, and we're going to have some held out test data that no one gets to see. At a certain point, all the competitors submit their systems and we see what happen."" That's an improvement in the science compared to what was going on before. But that is not the whole story. If you want to understand how the system is working, if you want to understand how to build the next system, you can't just test it on some standard thing. You also have to look at, ""Well, what kinds of errors does it make?"", and ""How do the different systems compare not just in their overall number, but in their failure modes and which inputs work for them and which ones don't?"" and on and on like that, as opposed to, ""Okay, I got the highest score. I'm done."" Lukas: Right, right. Well said. I don't have much to add there. Can you say a little more about like...I feel like this is a great paper in that you make these really concrete, sensible recommendations. You sort of suggest a few alternatives to benchmarks. Could you maybe run through those for anyone listening? Emily: Yes, absolutely. So, it's more of complements than alternatives to benchmarks. So, in addition to benchmarks, this can be used sort of as a sanity check or, ""Okay, did my system actually do better than a super naive baseline?"", or ""I want to compare some systems head-to-head, let's use this benchmark."" You might also use test suites, which are put together to sort of map out particular kinds of cases that you want to handle well, as opposed to just grabbing whatever happened to occur in your sample test data. You might do auditing, which is very much akin to test suites in saying...so this is like Joy Buolamwini and Timnit Gebru and Deb Raji's work on auditing face recognition data sets, where they sort of systematically created the set looking at two genders and a range of skin colors and sort of say, ""Okay, is this accuracy actually even across this set of people or no?"" And they found out no. So, that's a- Lukas: How's that different than a benchmark? That kind of sounds like a benchmark, doesn't it? Emily: So, it's not the way benchmarks are typically created. You could imagine someone creating a benchmark that is sort of systematically mapping out a space, but that's not the practice. The practice is, ""We are going to go grab some data from somewhere, and then hold out 10% of it to be the test and the other 90% is training, or 80% training, 10% dev,"" right? The way benchmarks are typically put together is, ""Let's just grab a sample of data and see how well this thing works,"" as opposed to ""Let's create a testing regime through test suites or through this auditing process that can allow us to find the contours of its failure mode."" Not ""How well does it work on average?"" but ""How well does it work for this case, and that case, and that case?"" There's also adversarial testing, which is...a few different things fall under adversarial testing. Sometimes people will create test sets by going and collecting all the examples that previous systems did poorly on to make a particularly hard test set, which is interesting in the sense that it can filter out the sort of freebies that are too easy, but also doesn't necessarily guide anything towards better performance for a particular use case. Because it's just sort of like, ""Well, we're selecting what was hard for the previous model,"" not ""What's particularly important to get right or what's particularly likely to be frequent in our use case,"" and so on. So, that's one kind of adversarial testing. Another one is what we did in the Build It, Break It shared test. This was Allyson Ettinger, Sudha Rao, Hal Daumé, and I in 2017, put together a shared task where we had system builders and then breaker teams. The breaker team's goal was to find minimal pairs, two examples that were minimally different to each other, but would work...for which the systems would work for one but not the other. That would be a way of sort of mapping out what causes system failure. So, you can look at that. You can look at error analysis. Take the test set from the benchmark or the dev set from then benchmark, and then go in and look and say, ""Okay, what are the kinds of problems that are showing up?"" A lot of systems that rely on language models tend to do really poorly with negation, which is one of these things that's very important to the meaning, but tends to be a short word or subword, and so it is easy to miss. You can imagine speech recognition or machine translation, if you missed one word out of 20, it matters a lot what that word is. If you replace ""a"" with ""the"", in many cases, that's not going to cause a lot of problems. But if you just skipped a ""not"" somewhere? Lukas: Yeah, that makes sense. Yeah. Emily: All of this is basically about looking at what it is we're trying to build, what it is we're testing on, how it fits into the motivating use cases, and then what works and what doesn't, and for what doesn't work, what are the implications? What happens in the real world if that failure happens? And also, what are the likely causes? What is tripping us up? All of that is what we would like to see, instead of the leaderboard-ism, which is everyone just trying to climb to the top of the pile-on, which doesn't feel like it's really... I mean, people talking about the speed of progress in AI love to talk about how quickly those leaderboard changes and how quickly the state-of-the-art, SOTA, gets higher and higher on these various benchmarks. I always think, ""Yeah, but so?"" What does that actually mean in terms of understanding the world better from a scientific point of view or building technology that works better not just in the average case, but also in the worst case? Lukas: Yeah, it's interesting. Well, I had a couple things came up for me reading that paper. When I started my career, I think I was just sort of on the tail end of ACL papers where it seemed like they would just cherry pick some examples where it worked or it didn't, and it just seemed ridiculous. I remember they had early benchmarks and people would have lower accuracy than just guessing the most common case or something, which you could argue that's better, and people did, but that just seemed a little ridiculous to me. I remember this anecdote from your class about...I think it was Noam Chomsky saying that, ""Oh, moms don't teach kids language,"" but actually they do, and it's just like no one bothered to check. So, it's kind of maddening, and I think I appreciated benchmarks from that. But then your recommendations are not only reasonable, I think in companies, a lot of it is standard best practice. I don't think you would just release a new model without trying it and getting a flavor for where it works and where it doesn't. You wouldn't just be like, ""Oh, we took 10% of the data and held it out, let's ship it."" It does seem like that's actually one case where you see it more in companies than in sort of academic literature, probably because it's easier to look at one number and be like, ""Hey, we beat it."" But clearly, that's flawed. So, anyway, I thought that was a great paper with really good suggestions that I think everyone should definitely follow. I also want to make sure we got to the last paper that we talked about, which is cool, because I just want to make sure people know. What is the Bender Rule?⁴ And why is it important? Emily: So, Bender Rule or the #BenderRule- Lukas: Is it #BenderRule? Emily: Yeah, well, it's both. Lukas: Say what it is first, and then I have some questions about best practice. Emily: Yeah. It is itself a best practice, which says that you should always state the name of the language you're working on, even if it's just English. This is a soapbox that I've been carrying around and periodically climbing up on since about 2009, where I saw a lot of that pre-neural statistical NLP work saying, basically, ""Look Ma, no linguistics,"" and claiming that systems were language-independent because there was no linguistic knowledge hard-coded. And these supposedly language-independent systems were mostly tested on English. You also see a lot of work people will publish a paper on machine reading or paper on sentiment analysis, and in fact, no, it's a paper on machine reading of English and sentiment analysis on English text. Flip side is if someone's working on Cherokee, or Thai, or Chinese, or Italian, then that work...it's harder to get it accepted to the research conferences because it is deemed language-specific, where work on English is somehow general. That's a big problem for the science, it's a big problem for getting to technology that actually works across languages. I've been sort of going around pestering people to actually test cross linguistically and to name the language they're working on. In 2019, like three or four people — and this is in that piece on The Gradient, I have their names listed — sort of referred to this practice as the Bender Rule. I didn't name that, but once it was named I ran with it. Part of it is it's kind of a face threatening question to ask. If someone's written something about machine reading and I walk up and I say, ""What language?"", it's a stupid question to ask because it's obviously English. So, it's face threatening to me. And it's also a little bit rude to them, to ask this question that says you should have said. I don't mind people blaming that on me. Part of the reason I ran with the hashtag is, if someone wants to go ask this question and they feel like it's a sort of a silly question to ask, they can pin it on me, and I'm happy to lend my name to that. Lukas: I see. Nice. I guess this is a hard question, but this is what kind of comes to mind for me, it's like, ""Wow, English is so specific and probably has all these kind of idiosyncrasies."" How do you think NLP be might be different if it started in like Thai or Cherokee or something or English just happened to be...I mean, English must be unusual in all these ways, right? Are there characteristics of English that are unusual and the world could have gone a different way? Emily: Yeah, absolutely. Actually, in that paper, I list out a bunch of them. One thing is English is a spoken language, not a signed language. If we had started NLP with American Sign Language or another signed language, it would have been very different, right? Lukas: Clearly. Emily: Yeah. So, that's one big choice point. Another thing is that English has a very well-established and standardized writing system. Many of the world's languages don't have a writing system at all, and many of them that do don't have the degree of standardization that English does. Also, many languages will have a lot more code switching going on, on average, than English does. Lukas: What is code switching? Emily: Code switching is when you use multiple languages in the same conversation, sometimes even the same sentence. That happens a lot in communities where there's a lot of bilingualism or multilingualism. So, if you and I...well, you also speak Nihongo, right? When you studied kanji, what was your favorite way to benkyou them? I am not a fluent code switcher, so that was really awkward and stupid sounding, but to illustrate the point. Lukas: I remember actually when...yeah, I know, I have experienced that for sure. Emily: Certainly, English is involved in a lot of code switching. But there's also lots and lots of monolingual English data and when you go into social media data for Indian languages, for example, enormous amounts of that are code switched with English. And so, there's a whole range of interesting technical challenges that come up there. We live in a world where the first digital setups were sort of accommodated...lower ASCII, most conveniently, English all fits in lower ASCII. English has relatively fixed word order. We have a relatively low...relatively simple morphology. Any given word that shows up is only going to show up in a few different forms. Compare that to Turkish where you can get like, I think, millions of inflected forms of the same root, and so that that changes the way you handle data sparsity and what data sparsity looks like. Our orthography is a mess. Someone was just asking on Twitter, ""How come we do grapheme to phoneme prediction but not phoneme to grapheme prediction? So, grapheme to phoneme is, ""Given a letter, what's the likely sound?"", and that's an important component of text-to-speech systems when you hit an out of vocabulary word. Phoneme to grapheme would be, ""Given a sound, what's the likely letter?"", and that's not a typical task. I wonder to what extent that's true because of English's opaque and chaotic writing system. Lukas: Right. Sounds like an impossible task. Emily: Yeah, exactly. But if you were to look at...Japanese, setting aside the kanji, if you just try to transcribe Japanese in kana, that's way more straightforward. Spanish also has a very transparent and consistent grapheme to phoneme mapping in both directions. So, down to things like that, the properties of a writing system for English. English likes to use white space between words and sentence-final punctuation. These are things that we sort of take as given, that it's easy to tokenize into sentences and words, that just aren't going to be true in other languages. So, I don't know. I couldn't tell you what NLP would look like. I can just sort of tell you sort of where the points of divergence might be. Lukas: No, those are fun. I mean, definitely. I mean, I don't know. Those differences are so interesting. Emily: Well, you voluntarily took a linguistics class, so I'm not surprised. Lukas: Well, I just feel like linguistics is so cool. I mean, as an outsider just because if you don't know it, then it's really eye opening to just...because you swim in it, to sort of see all these patterns that I never would have noticed. And I feel like especially...like phonetics is probably the most deep, where you're just like, ""Oh, my god, those two sounds are different?"" I would just never, never have noticed that. It's so easy to do the thought experiment and realize you're wrong, that it's just...I love that stuff I feel like most of my early work was in parsing Japanese in different ways. I do remember...I guess it didn't seem like that was an impediment to publishing, but it was surprising that there was so little work on it for how necessary of a task it would be to deal with it. In my first job, it was mostly processing Japanese language stuff, and it was striking how little research there was defined on the topic. I felt like there was a sort of more institutional knowledge inside of companies than literature on it. Emily: What happened in the research community is well, that kind of parsing problem is ""solved"" because people had made a certain progress on it for English, and that was mistaken as the problem in general being solved. So, what's new here? Well, this is for Japanese. That's new. But it's actually hard to get people to see that. My goal with what got called the Bender Rule is to say, ""Okay, let's keep English in its place,"" and say, ""When I've done this for English, I need to say that it's for English to hold room for the other work on other languages,"" which is also really important and novel and valuable. We'll see. If we periodically go through...different folks in the field go through and count how many papers in an ACL conference actually work on different languages and actually say what language they work on, and it's not changing as fast as I'd like. But there's some really good developments. The Universal Dependencies project has produced treebanks for many, many languages, and that has spurred a whole bunch of very crosslinguistic work, which is exciting. Lukas: What do you think about... I mean, some of the most evocative work feels like building language models across all the languages or translation models that can kind of use pairs of languages in interesting ways, where you have more data to help with ones with less data. Do you think that's a fruitful direction? Or does that...do you think that sort of encodes our biases somehow in the way it works? Emily: I mean, it's certainly interesting, and to the extent that we're relying on these massive data-hungry things, where languages just don't have that much data, seeing what we can do based on transfer from the bigger languages is an interesting and valuable way to go. I think the interesting questions to ask would be, ""To what extent does this impose the conceptualization of the world encoded in English on to the results in other languages?"", and ""What follows from that? What are the risks?"" How does that compare to, ""Well, but if we just do monolingual, we can only get this far, so, we'll take those risks. We'll figure out how to mitigate them."" That kind of work I think is important. It's also really, really important to know that you are working with genuine data in the low resource languages. There was this thing where it came out that — I think it was Scots — the entire Scots Wikipedia was written by one person who doesn't speak Scots. Wikipedia is this really important data source in NLP, so any NLP system that claims to be doing something for Scots just isn't. A fantastic model in that regard is a research collective called Masakhane, which is a continent-spanning research initiative in Africa towards doing participatory research to create language resources for African languages. They've done really interesting work on how to build up the community so that people can come contribute as translators, not machine translation specialists, but people actually translating language. There's a really cool paper that came out in I think findings of EMNLP last year describing Masakhane project. That kind of work of, if you're going to work with low resource languages, being sure to connect with the community. Who would be the people using the technology, then you could find out, ""Okay, well, what are the concerns? To what extent do you want to bring in what we can do from using the larger resource languages?"" versus ""Would you rather stay monolingual and see where we can go and hear from the community and involve the community in the research? I think Masakhane is a great model of that. Lukas: Cool. Well, that seems like a good place to end. We're way over time and you've been really generous. Thank you so much. I really enjoyed talking to you. Emily: Yeah. Likewise, thank you. I can go on and on. So, I appreciate the chance to do so. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. Check it out.",12852 +Jeff Hammerbacher — From data science to biomedicine,https://www.youtube.com/watch?v=NyH6tt86EVU,3413,2021-08-26,"Jeff: I was using data science as a backdoor to problems. It was like, I could talk to people and figure out what they’re working on and figure out how the software and algorithms and analytical methods they were using mapped to problems that I’ve worked on previously and use those analogies to move sideways from work that I had done outside of the biomedical domain into the biomedical domain. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host, Lukas Biewald. I've known Jeff Hammerbacher for a long time, and he's had a truly incredible career. He started off running what was, essentially, the data science team at Facebook, and then founded Cloudera, which was a really early company in the data science space and recently went private after being public for quite a long time. But, mid-Cloudera, he actually left and became a professor at Mount Sinai and started his own lab. Now he's working on a company called Related Sciences that does drug discovery with machine learning. I actually ran out of time talking to them today because I have so many questions and the stories are so good. This is a super fun one. Jeff, thanks so much for doing this. Jeff: Yeah, man. Good to see you. Lukas: Yeah. Good to see you. I want to get into the stuff that you're working on at the Hammer Lab, but this is, obviously, for a lot of people who've come up through data science, that we record this for. I thought it might be interesting to start just with your early career, just because I think people would want to know about it, and you had such an outsized impact on the field of data science. I was curious just to hear your story about how you came into Facebook and how you...I think you started a data science team there, right? Jeff: Yeah. So let's see, I landed at Facebook in early 2006. My initial title was Research Scientist and then, eventually, I ran a group of what we would soon call ""Data Scientists"". The next step after that was absorbing what we called data infrastructure at the time, which would — I suppose — now be called data engineering. So we ended up with a team called the data team. It was almost 30 people by the time I left, so it was pretty good-sized, and our mandate was effectively to collect all the data generated by the site and then do analyses on it to improve the business outcomes. It was a rapid learning experience. I was there for less than three years and we went from effectively zero data for offline analytics to petabytes per day. There was no real technology to support doing that at the time, so I was really spending a lot of time talking to people at Yahoo and eBay and Google, just trying to figure out what was going on. The commercial vendors...it wasn't really a blip on the radar yet to do data at that scale, so it was pretty intense. I learned a lot and I met a lot of great people and it eventually led to starting Cloudera. Lukas: People might not realize back then, it wasn't standard practice even to keep all your data. I remember talking to the CTO of eBay, even though I think a little bit after that, and he was saying, ""You know what? We only keep 1% of our click logs because it's just too expensive to store it all."" Why do you think Facebook was so out on the forefront of doing this kind of data analysis? Jeff: We were certainly not ahead of Google, so I would never claim us- Lukas: That's a high bar. Jeff: Yeah. I wouldn't claim that we were at the forefront. I would say it was ""necessity is the mother of invention"", in that we just had so much data and so much user activity that we wanted to understand and our product was evolving so quickly. I think the need for offline analytics was really driven home to executives during the News Feed launch. This is something that's probably incomprehensible to most people listening, but Facebook didn't have a newsfeed when I joined and we went and launched that six to eight months after I joined. I remember getting a phone call...so Mark and Chris Hughes, two of the founders, were doing Facebook's first ever press tour. News Feed was a big deal and we had a PR and marketing function at the company, finally, and so they had lined up all of these interviews on the East Coast, and the launch was a disaster. They were out here fielding questions and freaking out because the narrative around the product launch was very negative, but our metrics were pretty solid. So we were spending a lot of time really digging in to understand what was happening to user activity to try and distinguish the narrative, to just see what the users were telling us from what the press was telling us, and then helping to decide whether we needed to roll the thing back. It was a pretty big crisis at the company, and so using data to help stabilize product decision-making...then, I think after that, it became a more critical function at the company, but it took a long time. I think growth was another big motivator. It's another part of the Facebook story that's not really well understood, is that we kind of went sideways for six months there in late 2007, early 2008. There was a lot of stress in the executive team and the engineering team, and a large chunk of people got re-orged to really focus on growth. That ended up creating probably the highest-level awareness that we needed to invest in data infrastructure and data science. I think those were probably the two things that I look back on and think...and also just internationalization. It's tightly coupled to growth, but at some point, you're navigating the product through your intuitive understanding of what people in your demographic cohort want to see from the product, but then you have to transition to understanding what a grandma in Turkey wants from Facebook. At that point, you really need to start flying with instruments, so I think those are some milestones that I can recall from over a decade ago. Lukas: It's funny even to think back to then, but NoSQL was not really a thing a lot of people knew about. What was your tech stack in 2008? Do you remember where you're storing all this data and how you were querying it? Jeff: Oh, totally. Yeah. When I landed in 2006, the tech stack...well, first of all, in 2005 they didn't use version control. This is one of my favorite things about Facebook, was they had a Cron Job that ran every night and tar'd up the source code and copied it off to storage. That was how they did version control. So, it was a different time. GitHub didn't exist, Subversion was the dominant source control product. The tech stack was the LAMP stack, it was Linux, Apache, MySQL, PHP. Facebook played a big role in adding another ""M"" to that stack, Memcached, which was essentially Reddis-ish. The current modern thing I would...it was a key-value cache, so you didn't have to hit the database. It was basically like if we hit the database, we had failed on the application side because it was just the user activity was so high so it had all come out of the cache. So in terms of the stack for analytics, when I got there Dustin Moskovitz, one of the founders, had built something called the Watch Page. The Watch Page was powered by a Cron Job that woke up every minute, issued a query to every MySQL production database to just gather some stats about user activity, and then pull the results of those queries down into another offline MySQL database, which contained a rolling time series of per-minute metrics. That was great and we used it for a long time. That's what everybody internally was watching to see user signups and that's where a lot of the metrics around daily active, monthly active would get defined and pushed out. But we had no offline data store to do analytics work on. These were summaries that were computed at the time of the query and pulled back, so you couldn't do any post hoc analytics over it. So, the initial attempt at a tech stack for a data warehouse was to use Oracle. So, actually, that was me. I didn't make the purchasing decision, but I had to do a lot of the installation maintenance of that thing. I very clearly remember the Sun T2000 server that we were running on and...obviously, this is all colo'd and not in the cloud at the time and, you know, fiber channel interconnect to a network-attached storage device and running Oracle RAC (Real Application Clusters) at the time- Lukas: Was it sharded or is this like one machine is holding a whole... Jeff: Oracle RAC was a shared storage distributed compute...so a bit like the architectures that we end up with today in the cloud, where we have this bottleneck to get to your object store — that's how databases were — and these were block stores. I said an architect storage, but it was actually a Storage Area Network which was speaking a block protocol to the server, not a file-oriented protocol. That's how databases were built at the time. It was insane to conceive of writing a database that wrote to a file system, they had to talk to the block layer. The file system was just going to slow you down. So yes, we ran Oracle RAC which was, like I said, shared storage, distributed compute, and it fell over immediately. I remember we hired a DBA, a database administrator, and he quit on his third day. He was just like, ""I've never seen anything...this is crazy. What are you doing?"" There's this guy, Tom Kyte, who wrote a lot of books about Oracle database internals. I was reading a lot of Tom Kyte books, learning a lot about tuning. You know those early multi-core — these Sun Niagara chips — were one of the first multi-core...Now we're all stuck with it because Moore's Law has basically ended, but it was really the beginning of the end there. So, learning a lot about how to scale up on multi-core settings and then just starting to look around frantically for something that could scale past that. We had two sources of data at the time. We had production databases, but then we had the major source of data that just ended up totally flattening us. We called it Falcon and it was built by a guy named James Wang to power the News Feed, and it was just an event log. It was the kind of thing that you would pass through Kafka today, but it was just this homegrown C++ toolkit. It was eventually replaced by Scribe, which we made open source, it was a popular tool for log tailing. That was written by a guy named Bobby Johnson. The Falcon logs were the vast majority of data, all the event data. Any time a user did anything on the site, we'd log it, and then we wanted to use that to reconstitute information about user activities. Falcon is what really ended up just knocking us down, and so I was frantically looking for a new tech stack beyond just an Oracle RAC instance. There were a few alternatives. At the time there were a lot of shared-nothing distributed database companies targeting the data warehousing market. Neteeza had been very successful using custom silicon ASICs to accelerate queries in a shared-nothing architecture. They had gotten bought by IBM for $400+ million and that really caused a lot of new entrance to come into the market. So these were companies like Greenplum, Master Data, Vertica, Parexel. A lot of interesting distributed database companies, but most of them couldn't scale to what we needed. Honestly, the Yahoo experience was what I modeled a lot of our tech stack after. You'll be familiar with that from your time there. So they had a similar SQL querying over event log data infrastructure called MyNa, My NetApp, which unfortunately, I didn't spend a lot of time talking publicly about, but I managed to get to know the people that built it and learn about how it worked. It was effectively a Hadoop-like architecture, but instead of a data node in a distributed file system, they had NetApp filers where they were querying data over. So we hired a guy named Suresh Antony who built, effectively, a very rapidly implemented version of MyNa called Cheetah to bridge us between the Oracle era and whatever came next. Then we started really looking around and we found the Hadoop group at Yahoo. Eric Baldeschwieler and folks, Owen O'Malley and Doug Cutting, obviously, were doing some really interesting work to pick up this work that had been published by Google about MapReduce and the Google file system and implement it as an open source project. Everybody thought it was insane at Facebook. Writing stuff on the JVM was just very much frowned upon. It was very polyglot programming languages environment, but the only exclusion was Java from that zoo. It was an uphill battle to convince people that this was going to be something that might solve our problems, but eventually it became a pretty significant component of our infrastructure and we ended up writing a lot of database utilities on top of it. A project like Hive — it's what a SQL query interface and a metadata manager in front of the distributed file system in MapReduce implementation — ended up becoming a really significant component of our analytics tech stack there. Lukas: It sounds like you have some of this infrastructure built when the growth stopped. I think a lot of people, myself included, relate to the pain of growth stopping and trying to figure out how to get it going again. Was there some piece of analysis that you felt like you did to get that restarted, or was it just a lot of little things? How did that go down? Jeff: I've actually gotten a cease and desist from Facebook before for saying this in an interview, but the honest answer is the Hotmail contact importer. Lukas: I remember that, yeah. Jeff: That was the era. That was the social graph of 2006 to 2008. It was Hotmail. Yahoo Mail to a lesser extent, like a 10th, and Gmail, even smaller than Yahoo Mail. It was really about — what do they call them, dark design tactics or something? — it was these things where it was like, ""Put in your email address and we'll invite all your friends, and we'll just auto select all of the emails and obfuscate that, and if you click okay we're just going to spam your inbox and spam your mailing lists,"" and that was really how Facebook grew. There was a lot of stuff after that that was a lot more targeted. In our group, we had a guy named Itamar Rosenn who was my first hire and is still there. Lukas: Oh, no way. He's a classmate of mine. Jeff: Yeah. He's still there. I was just texting with him yesterday, I got to catch up and see how that's going. So, Itamar...there's a guy, Matt Cohler, who was an executive who was really one of the key strategists for early Facebook. Cohler — I'm sure at the behest of Mark and some of the board, or potentially it was his own idea — I'm not sure exactly who, but it was communicated to me through Matt Cohler. He pulled me and Naomi Gleit, who you may have also been a classmate of, if I know my Stanford connections. He pulled me and Naomi and he said, ""Hey, growth is an issue. Let's start dedicating some analyses to it."" We started meeting regularly and doing analyses, and Itamar joined not long after. Itamar generated this weekly growth report, which was a set of standard metric as well as a deep dive every week that was distinct and specific to some high-level question we had at the time. That growth report, we turned it into a PDF to make it look nice and sent it out to the company. I used a lot of LaTeX back in the day for my math notes in college, so I like to- Lukas: You tee'd that up at LaTeX and then used that as a company report? That's amazing. Jeff: You do it for a year and all of a sudden you're fluent and so then it's hard to go back because it just looks so much better when it's in a nice...there's all kinds of better ways to do it today, but that was my solution then. So yeah, so Itamar would send out the growth report with a lot of input from Naomi and Cohler, and that became a focal point for analyses to better understand growth. Then, ultimately, a growth team was built. If I recall it correctly, James Wang, the guy that wrote Falcon, ended up being the engineering manager for that growth team. He played a big role in the initial work that they did over there. Lukas: Wow. That was really fun, too. Thanks for taking me through that. Then, I guess I have the same question on Cloudera, which is also an iconic company in data science. I remember when you were starting it and thinking about what the market size would be, but I guess what really prompted you to start it? Can you tell me a little bit about the early days of getting that off the ground? Jeff: Sure. Well, we tried to start it earlier in 2008. So this guy, Christophe Bisciglia was at Google and was teaching a MapReduce class at University of Washington and was really trying to push Google to proselytize their approach to data management and data analysis into the academic environment. He was using Hadoop in that course, so he was connected to the Hadoop community through that. Microsoft made a bid to buy Yahoo in early 2008, and that cataylzed...so Christophe and I had been chatting about what would it look like to start a company to support Hadoop, because he needed it for his work and I needed it for my work. When Microsoft said they were going to buy Yahoo, then we were like, ""Oh, boy. We really need to accelerate the timing on this."" So that was early 2008 and we had a third guy who was going to be a co-founder, a guy that I gotten to know because we interviewed him to be VP of Engineering at Facebook and we actually offered him and he turned us down. Mike Abbott was his name. So Mike is now at Apple running a big swath of their software development, and I really hit it off with him during the interview process. I stayed in touch with him and I was like, ""Hey, man."" He had a lot of experience with database internals, he had a startup company called Composite Software that did federated query, which I guess today would be called a data mesh. Mike was always a smart guy and I really wanted him to start the company, but he actually had some personal life issues that made it not really work out. It kind of fell apart in March '08, but that got Christophe and I talking, and he started working on his own. He recruited a guy named Mike Olson, who I had followed for a while because Mike was the CEO of Sleepycat Software, the maker of Berkeley DB, which is an embedded database that was very, very successful. The killer app was active directory. Mike had sold his company to Oracle, had done two years, and was on the way out. Christophe had recruited him to...he actually incorporated the company as ""Clouderra"" with two R's and Mike was the CEO, but another guy, a third guy, Amr Awadallah — who you probably know from your time at Yahoo. He had run a group called Product Intelligence Engineering, it was very successful — Amr had spun out of Yahoo and was convinced by a guy named Andrew Brachia at Accel to be an entrepreneur-in-residence at Accel Partners. Amr was, actually, at the time working on a spot market for cloud resources, which was very early in 2008 to have this idea. So we were like, ""Maybe this isn't the right time for that. Maybe someday it'll work."" I had spun out of Facebook to do an entrepreneur-in-residence program at Accel Partners as well. I had actually cooked up with a guy named Eric Vishria who's now a partner at Benchmark and we were working on a consumer energy demand monitoring system. Eventually, Amr and I got to chatting and Christophe really catalyzed the whole thing. He and Mike were moving forward and Amr and I were like, ""We should probably hop on there."" So Amr, me, Christophe, and Mike ended up reconstituting it as ""Cloudera"" with one R and then just re- founded the company going forward. We ended up hiring Doug Cutting about a year later once we had established some credibility, but it was just the four of us when we got moving. Lukas: What did you work on in the early days? It must've been a pretty big change going from running the data science team to founding a company. Jeff: Oh yeah, for sure. On the one hand, yes, on the other hand, no, because at Facebook it was a very sink-or-swim culture. I really felt like I built that data team with no real supervision. I basically went around the block once a week with Adam D'Angelo to just have a conversation for an hour. He was very helpful about just clearing roadblocks for me and helping me think through strategic things. But ultimately, it was just something that I thought needed to be built and they just said, ""Go build it."" I don't think anyone up top at Facebook was like, ""Let's hire 30 people to work on a data team."" I think I just kept hiring people, and at some point they looked over and they were like, ""That's a pretty big data team."" People talk about an ""intrepreneur"" or whatever and I guess it did feel a bit like that. I did feel like I was just building a little company inside of Facebook, and ultimately the Cloudera product roadmap was just the Facebook data infrastructure product roadmap done as a... Most of the reason I started Cloudera — or I got involved with Cloudera — to be honest was, I just wanted to see the things that I wanted to build exist in the world. I knew that Facebook, they were entering a period where they weren't going to be quite so excited...it was more of a ""buy versus build"" period — which made complete sense given the scale of the business and the success of the business — so I was like, ""I'd rather build some of this stuff."" So we got to work. Hiring was, obviously, a lot harder to hire for a random startup versus the hottest startup in Silicon Valley. I had to do a lot of legwork on hiring, and then just figuring out what to build. Sequencing, I knew what the end state was going to look like, but I didn't know how we were going to get there. Figuring out what to build first was pretty hard. We started with a couple of open source projects to just get data into the Cloudera environment. A project called Sqoop and a project called Flume that were dedicated to database and log data, in particular. Honestly, I saw Splunk at the time and I was like, ""I want to get to a pricing structure that looks like that."" I think the reason why data companies work in 2021 is the consumption-based pricing and Splunk had that figured out in 2005. But we never really could figure it out at Cloudera, We ended up getting stuck with a more Oracle-, Teradata-like pricing model. So yeah, so we were working on it, effectively filling out the stack to become a vertically integrated data platform — whatever they're calling them these days — but a place where you would collect data, put it, structure it, query it, analyze it, fit models to it, what Snowflake and Databricks are trying to build today. It was a very obvious product roadmap. That's what we wanted to build, we just couldn't figure out how to build it or how to get there, what the right sequencing was to get there. The other thing that we had to do is swap out components over time. We all knew that there was a shelf life to the core Hadoop projects and so we were trying to think beyond it. How do you make that transition from these legacy products to what we felt could actually serve as production enterprise workloads competitively with what other vendors were offering? Things like Impala for query engine or Kudu for table storage were always something we wanted to build, but just had to figure out when and how to get it out. Lukas: I think one thing that was interesting at the time — it seems so wrong in retrospect that it's hard to believe people thought this — but I remember actually talking to Matt Cohler about Cloudera and he was thinking, ""How many companies would really use this? Maybe it's tens or maybe a hundred,"" or something like that at the time. I think even you expressed a little bit of doubt to me when you were starting. Did you feel worried about the market size or how did you think about that? Were you just sure that it would work or was that ... Jeff: Nah. For me, it was about manifesting a product vision, not about building a huge company. Lukas: Interesting. Jeff: I didn't expect it to get as big as it did, or people to care as much as they did. When I was leaving Facebook, I wanted to work on a super nerdy infrastructure software company. What could be nerdier than Hadoop? Within a year, we were in the New York Times and that part of the hype around it was always a huge turn off to me. It wasn't something that I wanted. I wanted to hire the best engineers from Sun and VMware and Oracle and Google and get them to build open source infrastructure that would allow any company to do what Google could do. That was what I wanted to do and whether or not it had commercial value at the scale that would necessitate venture returns, it wasn't that critical to me because we didn't raise that much money. Our Series A was $5 million. Our Series B was 8 or $9 million. These aren't even seed rounds anymore. So what we were building was different from what it became. I agreed with Cohler at the time, I didn't worry about who was going to use this because I just worried about completing the product. I just knew everybody was going to need it, to be honest. Everybody was going to have a petabyte-scale data. I didn't know in what form they were going to be storing and analyzing it, but I wanted to solve problems to facilitate that world. But yeah, our Series B was a brutal fundraise. Our Series A was easy because Amr and I were both entrepreneurs-in-residence, and so we had two partners who loved us and believed in us and they would have given us money to start whatever we wanted. But then our Series B...we ran around Sand Hill and I actually remember I got a nice note from Dana Stalder at Matrix Partners a few years after, because he just beat us up in the pitch where he was just like, ""I don't ever expect you'll get a seven-figure deal for this."" He was like, ""You'll probably get less than 10 six-figure deals for this. There just isn't a market. You should just pack it up now. This was like a science project."" You're in those meetings and you just hear that over and over and over again and it's like, ""Yeah, that's a valid position to take."" I didn't necessarily disagree with it. So yeah, I couldn't be happier. The fact that they're still focused on open source is quite cool. Lukas: Do you feel any frustration that they're not a more iconic company? They were so early with the strategy that's worked so well and it's hard to say, I don't know, whatever their $5 billion market cap is not a wild success, but it does seem like they missed people shifting to Spark. Does that bother you at all? Jeff: I'm kind of weird in that I don't like big companies. To me, it's not a success if you have a hundred billion dollar market cap, but you've got all closed source software and you have it...so to me, I always talked about Cloudera as an engine for turning VC first and then company enterprise dollars into open source software. So for me, I look at the public goods that were created. I look at the standards, the software, those kinds of things. Honestly, I made plenty of money, I'm going to be okay. People who want another zero, at this point, it's all going to some foundation. You know what I mean? There's no material needs that's going to be resolved by if there was another zero on the end of Cloudera's evaluations. I honestly don't know why people want more money than what we were able to make, and that was honestly a pretty big surprise anyway. I didn't start Cloudera to make money. For me, I look at things like Arrow and Parquet and Ibis and other kinds of open source infrastructure...even Hue, our user interface, has become adopted by pretty much all the cloud providers. I look more at, ""How do you change the tools that people use in their-."" and ""How do you change their thinking?"" Impala was really the first open source, vectorized, codegen, distributed query engine. It was something that everybody knew we needed to build and I was really proud of it when we built it. Whose name is on the jersey, at the end of the day, I don't really care. It was more about impacting the universe of ideas and public goods. I'm really happy with a lot of the work that we did. I will say, I think just being on the JVM is just tough for day-to-day developers. You can impact enterprise, but ultimately, no one uses enterprise stuff in their day-to-day. Snowflake is a huge company and they've built great technology, but it doesn't change how I do data analysis on a day-to-day because I don't need a super expensive data warehouse for my day-to-day data analysis. We built a lot of stuff off the JVM at Cloudera, subsequent to founding. It was a funny era to get stuck in JVM. I wish we had pushed more Python. We ended up buying DataPad, Wes McKinney's company, and we had Wes McKinney in the company for a while. It was after I had checked out — I was Founder Emeritus at the time, I referred to myself — I could never really convince our head of product management to really push on the Python ecosystem harder, but you can see that's where everybody's going now. I think if there's anything that I regret, it's not being able to influence people to get more into the PyData ecosystem sooner. Lukas: Interesting. I also wanted to ask you about this incredible career transition that you've made that I'm just so impressed by it, to go into research. Can you talk about how you did that, how you got up to speed enough to start your lab, how you learned about almost a totally different field? Jeff: Yeah, totally. So, 2012-ish at Cloudera, we were four years in and it was bigger than I ever expected it to be. I'd replaced myself twice, first as VP Product and then VP of Data Science. I had hired people who were better than me at that job. The only thing left to do was hire a professional CEO, and we kicked off that search. To be honest, I was also having a lot of misgivings and also health issues that just made being a high intensity startup founder executive job in San Francisco just very unpalatable to me. When I was thinking about what I wanted to do next, I really wanted to focus on finding a domain where I could do data science and not get bored of the entities under analysis. I had started my career on Wall Street and very quickly realized I didn't really want to think about money all day. Then, I moved to Facebook and pretty quickly realized I didn't want to think about how people navigate consumer web products all day. But I loved the software methods at both jobs. It was a weird thing. I really enjoyed my jobs, I just could not care less about what the product was at each of those jobs. Cloudera was always, to me, a way point where I was like, ""Hey, I want to be able to do data analysis at scale. Tools don't exist to do that with open source software. This is our best hope of just getting some tools for doing data analysis at scale into the world, so I'm going to do that."" But I do data analysis, I don't necessarily see myself making tools for data analysis for the rest of my life. In 2012, I started thinking about different domains where I might not get bored, and biomedicine was just a big, expansive domain where I thought there's a lot of sophisticated work happening, but the technical infrastructure was actually pretty limited. We had sold into pharma companies at Cloudera and they were some of the last to adopt modern technology stacks. We had partnered with some large academic institutions and I saw their infrastructure and it was very outmoded and slow-moving. So I thought, ""Oh, hey, there's some things that we learned over here that could be useful over there and I probably won't get bored of what's going on."" In 2008, when I left Facebook, I'd looked into the biomedical domain to do a startup and I had met a bunch of interesting companies at the time. This is like when 23andMe was getting started and — oh, there was another company that was just like 23andMe that I went and visited as well, I can't remember their name — so I got to know a group of people in the biomedical field who had started a nonprofit called Sage Bionetworks that was creating a shared infrastructure for storing and analyzing data in a pre-competitive, open source fashion. They asked me to come and advise them on data infrastructure and open source strategies as they were creating this nonprofit and, eventually, asked me to join the board. So I served on the board of that nonprofit and through that lens, I got to see and meet a lot of interesting people and it helped confirm for me that this was a field that I would enjoy working in. Ultimately, what catalyzed me moving into an actual role in that field was Eric Schadt, one of my fellow board members at Sage Bionetworks and one of the creators of Sage Bionetworks. He was recruited to run the Department of Genetics at Mount Sinai in New York City. I like New York a lot more than San Francisco. I moved to San Francisco from New York and I was very dismayed. I was like, ""I thought this was supposed to be a city. Everything closes at midnight or 2:00."" I don't remember. It certainly wasn't 4:00 AM like in New York City. It was so tiny and the public transportation was terrible and so I was always very underwhelmed with San Francisco as a place to live. It was so cold all the time, so I was very excited about New York City as a place to live, relative to San Francisco. I was excited about doing something in the biomedical domain with software and data. We were getting beers at the Nut House in Palo Alto, which I'm sure you know, and he was having me talk to people over there to just talk them through what they could build. He was like, ""What would you think about just being out here with an actual position at Mount Sinai?"" I thought about it and, ultimately, I was like, ""Okay, that sounds like fun."" We worked something out with Cloudera where I was like notionally part-time, so I was going back forth between San Francisco and New York for a while. In the fall of 2013, it was really when I was like full-time in New York and started hiring people in the lab. So I had a year to just read a lot of textbooks, talk to a lot of people who are working around, play around with the software. I've always been autodidactic. I got terrible grades in school. It was always about reading and thinking more than it was about doing homework for me- Lukas: You got terrible grades, but you went to Harvard, right? How does that- Jeff: Yeah. So it's a little bit complicated. I had a good SAT score. I started getting terrible grades junior year of high school. I had, I guess, enough good grades to buoy my grades and my overall GPA. And I played baseball, so I ended up getting into Harvard primarily as an athlete and an SAT score and then a decent GPA. It was basically, like, once I hit 16 that I stopped caring about school. I think early Jeff was engaged enough to achieve a GPA that was not going to be fully dismissed by Harvard during the admissions process, thankfully. Lukas: Mm-hmm (affirmative). But yeah, I guess you've done an incredible job of quickly learning really hard topics, so that makes sense. So you got up to speed...I actually try to research all the people that I talk to and I was looking through your list of papers and I could barely parse the titles to them, honestly. Jeff: Yeah. You talk to people who are doing work, you read papers. Review papers are key for me. Finding a good review paper on a topic, and then figuring out who wrote it and then what their recent research is, and just finding kindred spirits, people who think like you do and being able to converse with them and interactively map a domain. I had had biology education previously. Thankfully, Harvard is a liberal arts education, so I had done courses on DNA and neuroscience and molecular biology, so I had the basics. So yeah, just reading a lot of papers and...software is a good angle. I used to reference a lot, John Tukey, who was kind of a proto-data scientist, and he has a quote where he said, ""I love being a statistician because I get to play in everyone's backyard."" I was using data science as a backdoor to problems. It was like, I could talk to people and figure out what they're working on and figure out how the software and algorithms and analytical methods they were using mapped to problems that I've worked on previously and use those analogies to move sideways from work that I had done outside of the biomedical domain into the biomedical domain. There's a lot of problems that you can find analogies for and choose methods for. In particular, we were able to find a really cool problem in a domain of cancer immunotherapy. When I was moving into biomedicine in 2012...2011 was a milestone year for the approval of a immune checkpoint blockade drug¹. This was a drug, which rather than targeting anything related to cancer, was actually targeting the immune system. What it was actually targeting was...a T-cell is a cell in your immune system that's responsible for cellular immunity, for killing bad cells. Cancer cells are bad cells. T-cells were believed to be the mediator of the immune response to cancer. There was this protein...when a T-cell gets angry and starts wanting to kill stuff, it expresses an off switch because it's very important that you'd be able to turn T-cells off. T cells are very destructive and your body needs to be able to resolve the immune response, and so the T-cell exposes this off switch. The notion behind immune checkpoint blockade is ""Cancer might've figured out how to press that off switch. What if we basically covered up the off switch and we made it so that T-cells couldn't be turned off by cancer?"" Perhaps that would cause the immune response to cancer to fully eradicate the tumor. It works for a shockingly high percentage of people. The most exciting thing about immune checkpoint blockade — at the time — was these Kaplan-Meier curves, these survival curves, where you could see that immune checkpoint blockade was raising the floor for long-term survival of patients. It wasn't just advancing survival by a few months or years and then, ultimately, everyone had the same 10-year outcomes. It was genuinely changing 5- to 10-year outcomes. Obviously, it took a long time to see that, but those results are holding and that durable response to cancer was wildly unusual, and then- 1: ?? Lukas: Is that something you worked on? Jeff: Ultimately, yes. When I came to Mount Sinai I had never heard of it, but there was a principal investigator at Mount Sinai named Nina Bhardwaj. Nina was a very successful immunologist who was pursuing a few ideas for ways of stimulating an immune response to cancer. One of the things that she was very early on was an approach called a neoantigen vaccine. This is a therapeutic vaccine. Most people think of prophylactic or protective vaccines, something you get so that you don't get a disease. Therapeutic vaccines are given to stimulate a specific immune response while you currently have the disease, with the goal of curing it. A new antigen vaccine is a therapeutic vaccine. An antigen is a specific target of the immune response, and a neoantigen is an antigen created inside of a tumor cell due to the mutations that the tumor accrues. Cancer is a disease of the genome. The way that a cell becomes cancerous is that it accumulates mutations that equip it with behaviors that allow it to grow out of control. Often there's a positive feedback cycle, so getting additional mutations might damage your DNA repair machinery, for example, that then causes you to accumulate even more mutations. A lot of cancers have accumulated many mutations ,and the more mutations you've accumulated, the more likely that one of those mutations is to have changed a protein produced by that cell in a way that causes that protein to become immunogenic, that is to create an immune response directed against it. Neoantigens are those sub-sequences of amino acids inside of proteins that have been altered by mutations accumulated by the tumor cells, which create these novel or neoantigenic targets for cancer. The idea was, ""What if we could sequence someone's tumor, sequence their normal tissue, look for mutations that are in the tumor but not in the normal tissue, and figure out which one of those mutations might generate an immune response for this particular patient. For this particular patient, can we then synthesize a vaccine which will stimulate an immune response specifically against those neoantigens in their tumor, suited for their immune system?"" Everyone's tumor is unique, but also, everyone's immune system is unique. If you ever have to do tissue or organ transplant, you get HLA typing done. Your HLA type is what effectively determines which amino acid sub-sequences of a protein your immune system cares about. So you had two inputs. You had the HLA type of a patient, and then you had the somatic mutations — that is, the mutations present in the tumor and not in the germline tissue — and those became inputs into a predictive algorithm that would predict, ""These are the most likely to generate response neo-antigens."" That was the data science problem that we identified embedded within this larger research. At the time, Nina's group was just leveraging a web server built by another group and they generated predictions for her, and so we looked at it and said, ""Oh, hey. Maybe we can build you a better predictor of neoantigens."" Ultimately, she was very trusting and allowed us to participate in the phase one clinical trial. Our group wrote the computational component of the clinical trial protocol, and ultimately administered the computational algorithms that generated the vaccines that went into actual humans. That was a pretty fun research project to be involved in. Ultimately, the software we wrote called MHCFlurry² is...so finally, we get to something that might matter to your listeners now that we're what, 48 minutes into the conversation. If you made it this far, machine learning happens here. So we ended up building a neural network that predicted neoantigens called McFlurry that's now one of the better approaches, and is still actively developed by several people. 2: https://pubmed.ncbi.nlm.nih.gov/29960884/ Lukas: Two questions to make sure I understand what's going on here. So one, does this mean that every single person gets a slightly different medicine, based on...can you even do a clinical trial where everyone's...I always just imagine a clinical trial, everyone gets the same thing. I guess in this case, everyone gets the same process. Is that right? Jeff: Yeah, no. You've hit on a very fascinating question that has generated conversation at the FDA and that continues to this day, which is, when the therapeutic is an algorithm and not a molecule, how do you administer a clinical trial that can generate evidence that the algorithm itself can create better outcomes? Fortunately for us, they were pretty understanding and allowed the trial to go forward. I don't know how it's going to work. We were building what's called a peptide vaccine. The actual molecules that we put into patients were little sub-sequences of amino acids called peptides together with adjuvants, just general purpose immune stimulants to draw the attention of your immune system to those peptides. Peptide vaccines are very well understood as a therapeutic modality and widely considered to be safe. So I think that certainly helped, but the intervention under study in that clinical trial is an algorithm, not a specific molecule. It's different for every patient. Lukas: That's so cool. I guess the other thing...I hope I'm following all the steps here, but it felt pretty deterministic to me, like what's going on and then what intervention you would want. What's the part where you need a machine learning algorithm? I guess the way you were explaining it, I was thinking, ""Oh, you look at the genome and see where the problem is and then you know the amino acid, and then you know the medicine that you need."" Where's the uncertainty that requires you to use an ML algorithm versus, I guess, just some deterministic logic? Jeff: Sure. So the HLA type of a patient is a set of genome sequences for genes that code for proteins, which are highly polymorphic. That is, they're different across the population. There's at least six of these proteins that matter, and every person has a distinct repertoire of those six proteins. One input to the predictive model is the amino acid sequence of all six of those protein, and that's pretty variable across the population. Then the other input of the model is a window of amino acids around every point mutation that occurs in your tumors that doesn't exist in your normal tissue. Cancers can accumulate hundreds, thousands, hundreds of thousands, sometimes even millions, somatic mutations and melanomas, which have the largest mutational burden. You end up with two sets of sequences as inputs to the neural network- Lukas: I'm sorry. What's the output of the neural network? Jeff: The output of the neural network is a predicted binding. I don't want to explain exactly what HLA molecules do, but effectively, your body chops up all the proteins in your cells, a subset of them, for processing and it chops them up into smaller fragments. Your HLA proteins bind selectively to a subset of those smaller fragments, which your body believes to be interesting to present for inspection to your immune system. What you're ultimately trying to predict is the binding affinity between peptide fragments generated from the proteins in your tumor cells and the HLA proteins that are specific to your immune system. So, ultimately, the thing you're predicting is this protein peptide binding affinity. Lukas: How did you get labeled data for this test? Jeff: There's a group in San Diego that generates the vast majority of the labeled data, and they've done a great job of curating it. There's something called the Immune Epitope Database³. It's a fairly difficult...we actually got to the point where I had a wet lab and I talked to the group in San Diego about generating measurements of our own to create labeled data and they were like, ""It's not worth it. It's really hard. Just use our stuff."" Later in the lab's life, some new techniques for generating labeled data from in vivo tissue came out, that used a different measurement paradigm. Some of the work that we did in the lab as I was leaving — and it was carried on by members of my lab in the new labs they worked in — was to leverage this alternative source of labeled data and bring it together with this early source of labeled data. There's a few different assays, all of which are pretty difficult to run, so we don't get super high throughput. The mass spectrometry data, this novel source of label data, often it's positives only, so you're not necessarily measuring...there's a lot of work that has to go in, and as you're very much aware, you don't just get to pick up a dataset and fit a model to it and call it a win. There's a lot of work that goes into massaging the training data to get it ready for machine learning. 3: IEB: https://www.iedb.org/ Lukas: Is it important for this task, or these kinds of tasks, to use modern techniques like deep neural networks, or do you think simpler techniques would also work pretty well? Jeff: One of my frustrations is we didn't write more papers about the work that we did because one of the theories that I have for this lab was to just hire a bunch of people from industry and see if we get turn them into academics. One of the hardest things to do with people from industry is to convince them that writing a paper is worthwhile, but we did. We tried a lot of cutting edge. One of the guys that worked on the problem early on was a guy named Alex Rubynstein, who's now, actually, a professor at University of North Carolina, Chapel Hill in a biomedical department, which is cool. He did a PhD at NYU in the whole deep learning craze, so he was pretty experienced with the models. We iterated through a lot of more complex...this is the era when LSTMs were becoming very exciting, sequence learning models. So, I think, I remember Lasagne was a library built on top of Theano. I think there was a guy Colin Raffel who was really good with it, and he came down and talked with us. I feel it was Alex Graves at DeepMind that had a lot of sequence-to-sequence learning. We went up to NeurIPS three years in a row as a lab and presented some work up there. We were definitely paying attention to what was happening in the state of the art for learning on sequences. It didn't make a huge difference. I remember trying out Siamese networks and things like this and it wasn't really moving the needle. I honestly don't know where they landed, what the current version of MHCFlurry is, from a neural architecture standpoint. But I want to say that nothing we tried that was more exotic made a huge difference. So, ultimately, I think mostly no for that problem. I should also say that the leading predictor prior to ours, for a decade, was a neural network. So this is a field where they already were using neural networks before the deep learning craze happened. It's not like we were coming into the field and we were like, ""Hey everyone, neural networks."" They were like, ""Yeah, of course, neural networks, we've been using... "" We weren't trying to act like we were bringing fire from Olympus. It was like everybody was already using neural networks, but could you make better use of them? So embedding layers and things were relatively novel approaches. There were some ideas that we could bring to bear, but it wasn't just a slam dunk to just use the latest neural architecture. Lukas: What types of things are you working on now in your lab? Jeff: Well, nothing actually. I'm on leave from my lab so- Lukas: Well, what are you working on? What are you up to? Jeff: Yeah. I went on leave from my lab in January of 2020 because I started a biotech venture creation firm with two of my friends, Adam Kolom and Jack Milwid, in mid 2019⁴. One of the things that I did with my lab...so I started my lab up in New York City and it was purely computational. But one thing that you learn quickly if you're running an academic lab is that it's difficult to collaborate in academia, and it's a lot easier if you own vertical research ideas rather than being a person who brings a skill into a horizontal research network. Those are just a lot harder to build, those horizontal research groups, and they're often built through pedigree like, ""Oh, I did my PhD with this professor and so I'm going to work with you."" I had zero pedigree, so I recognized pretty quickly that this theory that my lab could be this ally to many other labs was like, no one wanted an ally. So I had to build data generating capacity on my own. I ultimately, ended up building a wet lab as well, and for a variety of reasons realized that academia was a better place for me to be doing basic science rather than translational science. So this neoantigen vaccine idea that we worked on when it was very early stage, ultimately there were several venture-backed companies that went public and had hundreds of employees working on it, including BioNTech, actually, which is the maker of the vaccine that I got for COVID-19, which was nice. It was a lot of...100x more resources could get put into that problem on the commercial side versus the academic side. So I decided to start angling my lab towards more basic science questions and doing mostly data generation with some computational work layered on top. We started working on things like optimizing protocols for genome editing in T-cells and growing organoids, which are small, 3-dimensional model systems to represent tumors in vitro that we could do more reliable experimentation upon. We layered some computer vision work on top of that, which was pretty fun. We did some natural language processing work over the research literature as well, but the lab became more of a traditional biology lab than a computational group. But the other part of that idea was that, ""Okay, my lab should become more basic, but I want to have some translational work."" So the translational work I decided to funnel through this biotech venture creation firm that we created called Related Sciences. So yeah, for the last two years or so, I've been working mostly full-time on Related Sciences. The idea of Related Sciences is to use data to identify promising pre-clinical therapeutic opportunities and to create companies to then pursue those preclinical therapeutic opportunities. 4: related sciences Lukas: Wow. Very cool. Awesome. Well, thanks so much for your time. It's such a pleasure to catch up and so cool, all the things that you've done. I love that I got a chance to hear all these stories, so... Jeff: Yeah, no, I wish...I could talk more about the fun machine learning tools and techniques we're trying out at Related Sciences some other time, but I'm always happy to talk about my history as well. Lukas: Yeah, I really appreciate it. We should do a follow up. If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description, where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce, so check it out.",9696 +Josh Bloom — The Link Between Astronomy and ML,https://www.youtube.com/watch?v=0aOXOT2TvUc,4096,2021-08-19,"Josh: Astronomy and physics work in a world that's sensor-based, fundamentally, in terms of our observations. Because it's sensor-based, there's noise. So, unlike in the AlphaGo-Atari world where every pixel has a perfect measurement, if you take an image of the sky or you measure some time series, there's noise associated with it. Because there's noise and because there's a finite amount of training data, if you build models off of that, you get uncertainties in the models because of its lack of expressiveness or its overgeneralization or overfitting. Then, you also have a source of uncertainty in what it is that you're trying to understand, just because fundamentally you don't have a perfect measurement, your signal noise is imperfect. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host, Lukas Biewald. Today, I'm talking to Josh Bloom who is the Chair of the UC Berkeley Astronomy Department. Astronomy has been the source of many innovations in data and machine learning. It's also changed a lot due to machine learning. I'm really excited talk to him about astronomy in general, but also how machine learning has affected the field. Josh, thanks so much for doing this. I have so many questions about astronomy in general, as someone interested in it but not very knowledgeable about it. I'm gonna try to control myself from just going down that path. One thing I was thinking about is, it seems like astronomy has informed...or ex-astronomers have done so much interesting work in machine learning. I was kind of wondering, if you have any thoughts on why that is, why there's such a path from astronomy into machine learning? I feel like it must have something to do with the large data sets that you all deal with, but is there something there? I mean even you went to a startup at some point and came back into the field, right? Josh: Yeah. The way I put it this way is that astronomers are quite good at using and co-opting tools that are built elsewhere to get our work done. Maybe the most famous example is this guy named Galileo who heard about this thing called the telescope and instead of pointing at the horizon looking for enemy ships, he pointed up that way, and the rest is history. We have been co-opting tools for centuries for our own benefit. And partly that's because I think astronomers are naturally curious people, but also because we're looking for an edge, fundamentally. We are often working right at the limit of where there's an obvious answer where you have a lot of data and it's high-signal noise to where it's just complete noise. And the real discoveries are happening essentially at the 5-sigma level. We are incentivized in many ways to pull in all these different tools and toolkits from all over the place. Astronomers, obviously, aren't just using these tools, we're using a whole bunch of inference techniques and problem-solving skills in a way that I think becomes very valuable outside of the specific questions that we ask. So, for sure, when I started a company in the machine learning space — and we can talk about the origin story of that if you're interested, and how we came to ML — we started hiring. And while we certainly weren't looking to hire people that had a similar background to us, often times when we got into coding exercises and we got into solving problems, a lot of the people that were making it through that we were excited about had a physics, more broadly, an astronomy background. They were people that could work with something that they had potentially never seen before, analyze it in a way an engineer might to get it down to its constituent parts, and then innovate on top of that. But I think you're right, the other big component, at least in these days, is the availability of just so much data, and our need to do something with that data in real time with limited resources is a natural entrée into where machine learning comes in. Lukas: From your perspective, what do you feel like the big interesting questions are right now in astronomy? What do you feel like you might learn in the next, I don't know, a couple decades that would really change this field? Josh: Well, it's all over the place. First of all, one way to think about astronomy is as a great laboratory for physics. So if we start there, and I think it's maybe somewhat apocryphal but Einstein really didn't like astronomers, but it turns out most tests of general relativity happen in the astronomy context. There are some terrestrial ways in which we can test GR, but most of the really interesting tests of GR these days, and has been for 100 years, is by looking at the skies. Specific events and specific large-scale structure of the skies gives us clues in some of the very basics of how the universe works at not just global scale, but at a microscopic scale. So we're also testing our understanding of how atoms work and understanding even what's going inside of the nucleus of atoms by looking at what happens on extremely large scales, which is just-mind blowing to think about. So, if we think about astronomy as that laboratory for physics, another way to ask that question would be ""What are the really important physics questions that we have?"" One is, ""What is the nature of matter at extremely high densities and temperatures beyond nuclear density?"" We have objects like neutron stars, which are extremely compact stars that have the same mass thereabouts of our sun, but are the size of San Francisco. So that density, we can't reproduce that in the lab. We need to look at how those stars behave when matter hinges upon them or just even what their static distributions are in radius and mass, to learn something about what's happening with nuclear matter at those really high densities. We don't know whether general relativity is right. It looks like it's really, really good on a lot of different scales, a lot of different mass scales and a lot of different length scales. But we're constantly testing this hypothesis that is general relativity, of whether it is a perfect description of how matter moves in the universe and how the universe is shaped by matter. We know it can't be perfect because it breaks down at the quantum mechanical scale, and there are things that happen in astronomy that allow us to test some fundamental precepts and hypotheses that come out of general relativity. In the gravitational wave world, which is essentially the ripples of space-time due to the changing locations of matter around other pieces of matter, we've had massive breakthroughs in just the last couple of years observationally, where we've seen the inspirals of black holes, and potentially the inspiral of neutron stars, smashing into each other. In the last few seconds, there is a huge burst of gravitational wave energy which we can detect on Earth, but we can also now start to see glimmers of the idea that we can start testing some basic ideas of general relativity in those last even milliseconds. So as instrumentation gets better there, I suspect our understanding of where GR is working and where it potentially breaks down will become really interesting. We're also interested in, at cosmological scales, in understanding the expansion history of the universe, the origin of the universe. Why did the universe appear to inflate and exponentiate so rapidly in just...less than a millisecond, 10⁻⁴³ seconds? Why did it grow so quickly? We know it had to based on observations at later times. What's absolutely remarkable right now is that when we look at the constituent parts that drive the dynamics of how the universe we think changes — as in how fast it grows and how fast it appears to be accelerating in its growth — ordinary matter is, as I'm sure you and your listeners know very well, makes up only a few percent of that recipe. Dark matter is a quarter of it and then dark energy is the other part of it. We really don't know anything about dark energy. We don't know whether it's a particle. We don't know whether it's something even deeper than that. We don't know whether dark matter is a particle on a tiny scale that isn't predicted by the standard theory or whether it's large clumps of black holes that were left over after the primordial expansion of the universe. So the biggest breakthroughs may come in a deep and fundamental understanding of what are those constituent parts. It may also come with a recognition that the framework that we have for understanding how the universe unfolds is right now fundamentally wrong and we'll look back on this in a couple decades and say, ""Boy, we were only looking at just part of the elephant, and now when we have a bigger picture of it, things become more clear."" There's more obviously. The last thing I'll just say because I'd be remiss not to, is understanding the origins of life and the prevalence of planets that can sustain life outside of our solar system. There is a huge push, both at Berkeley where I am and then across the world, in building new instrumentation and new theory that helps us understand how planets evolve, where habitable planets could be around sun-like stars, and how we're actually going to find them, characterize them, and potentially even understand what potentially primitive forms of life there are in those atmospheres. Lukas: So I have a feeling this is probably an annoying question, but it comes up a lot when I talk to ML people just in casual conversation who don't really know but astronomy, so I'll just ask it because I hear it a lot and I'm kind of curious. When I hear about dark energy and dark matter, I wonder do you really...is that just sort of like a fudge factor that shows that we don't really understand what the physical laws of the universe are? Is there a reason to call it matter and energy? Is there some sense that you're sure that it is matter? Josh: In some sense, there are kind of two fudge factors. Fudge factor A which we'll call dark matter and fudge factor B which we'll call dark energy. Dark matter is much better understood in how it behaves than dark energy. There's a lot of evidence that this stuff actually exists. I won't go into all the details here, but on many different scales, we have observational evidence that shows that while there are some people in the theory world that feel like they can explain away some pieces of that evidence, there is no successful alternative theory for explaining away this fudge factor with just sort of a different way of thinking about the universe. It looks like it's actually stuff. We know it interacts gravitationally and we hope that it interacts weakly in other ways. There are lots of endeavors actually looking for dark matter within a lab or within a cave and there's some other ideas of how astronomers could actually find the details of how dark matter interacts with itself, maybe with ordinary matter. So yes, it's a fudge factor in some sense to explain the overall evolution of the universe, but it was originally discovered to explain the anomalously fast motions of galaxies and clusters of galaxies. You just sort of add up the total mass associated with the light of galaxies, because we know how to roughly map the light of a star to its rough mass and the distribution thereof. There just wasn't enough mass, so there was this missing mass that's associated with galaxies. It turns out there's also missing mass associated with our own galaxy. We've been able to systematically rule out ordinary matter like electrons and protons, but I think the best bet is that it is some other series of particles that we haven't yet envisioned, but one day we may be able to find. On the dark energy side- Lukas: Do you know the distribution on a scale of a solar system? Can you tell where it would be from gravitational effects? It sounds like it follows a similar distribution to matter we can observe. Josh: That's right. Actually, we think within our own galaxy, the dark matter, which is all around us, either as, essentially, a fluid — there's particles of dark matter running through you all the time — or in extreme clumps in the form of primordial black holes. That's the other extreme that has the mass of a comet or a mountain, there'd be dozens of those flying through our solar system. There are potential ways in which we could actually discover these dark clumps and we have a whole series of observations looking for the particle version of that. It behaves a lot like ordinary matter, but in our own galaxy, while gas and stars — at least in the local solar neighborhood — are moving of order something like 200 kilometers a second, around the galaxy we think that there is a fluid, or these large clumps of it, which are moving in slightly different ways than the ordinary matter. So the ordinary matter and the dark matter, by definition of the gravitational interactions, actually do talk to each other and they do influence each other. But because the dark matter is sort of non-compressive and unlike gas, when you smash it together you get heat, this stuff, this fluid, sloshes around back and forth. I don't know the way in which we'd be able to detect the amount of dark matter that we think must be, let's say, in the sun because there's almost certainly some amount of dark matter that's been captured in the sun. It's such a small fraction compared to ordinary matter around us that...There are plenty of ways in which it could be hiding in plain sight. On the dark energy side, that is very much more of a fudge factor to explain the dynamics of how the universe expands. In fact, again, going back to Einstein, when he was working out some of the dynamics of the universe, he had this thing that he called his biggest blunder, which was coming up with this fudge factor constant to make the left-hand side and the right hand side of the equation work. Then when it was found in the '30s and '40s and '50s that there wasn't any of this accelerating expansion, he thought it was a big blunder, but ironically we actually needed that fudge factor. What's interesting is that we have that as a constant. It's got a constant amount of energy per unit volume, that's the simplest way to think about it. But we also don't know whether it's constant in time. It could actually be changing its constant. So there could be a temporal dynamic. Lukas: Wouldn't you see that in different rates of expansion? Josh: Yeah. So there would be different rates. There's already different rates of expansion just because in the early history of the universe, ordinary matter and dark matter dominated the expansion, as in it was sort of slowing up. As the universe became more tenuous and this material basically lost its dominance in these equations, there was a time several billion years ago when dark energy sort of took over and is now the thing driving the dynamics. If dark energy is a constant and we measured it well enough, then the universe will just continue to exponentiate and just grow, and will be this big...it won't be exactly an evaporation, but it will be called the Big Rip¹. It will basically all just rip apart from each other, and cosmology in the next hundred billion years will stop being about the observations of 40 billion galaxies and turn into just observations of stuff in our solar neighborhood because all the other galaxies will run far away from us. But we don't know that that's the case. It could actually...that constant could turn off for some reason. It could have other terms that haven't yet expressed themselves. Lukas: Cool. Well, I have to also ask you about the other topic that you brought up on finding so many habitable or seemingly possibly habitable planets in the universe, at least that we can see. Do you have any kind of thoughts on that? Are there theories why we don't see life on... If there's so many planets out there, why we don't see other life? Josh: Well, I think we know now that life, at least intelligent life, is not teeming, right? Enrico Fermi had the Fermi Paradox, ""If the idea of if life is so ubiquitous, why aren't they all around?"" It's pretty clear that it's not as ubiquitous as ""Every solar system has intelligent life"", that's not a big surprise. What we haven't yet nailed down in the overall demographics is ""What is the exact set of conditions that could give rise to any sort of life?"" We have a reasonable understanding now that of order...one solar system around a sun-like star will have of order one habitable planet. Maybe it's two or maybe it's a half, but it's not zero, and it's not ten. Then getting into the actual chemistry of what leads to — and biology of what leads to — life that's sustainable, that's really kind of where the cutting edge questions are on the theory side and obviously we have some great laboratories in our own solar system to ask those questions — form of atmospheres of other planets — and we're just now entering an era where we have sensitive enough equipment to be able to measure detailed chemical properties of atmospheres of other planets and other solar systems. What I think will become clear over the next, let's say, two decades is exactly what the rate is of planets that are in habitable zones around their sun-like stars that appear to be in some sort of disequilibrium when you look at the overall chemistry and the temperature profile of those atmospheres. How is it that we have something that is a volatile element that is still around? It means that there something else on the surface that's producing it. It won't guarantee that it's life. The question about finding other intelligent life that we could potentially interact with in some sense is beyond the horizon of modern astronomy, but there are groups, as you know, that are using modern astronomy tools to do those sorts of searches. Lukas: When you say disequilibrium, is that something that we would notice about Earth if we were far away from it and looking at it? Josh: Yeah. It's a little bit outside of my field, but if you took a spectrum of the Earth's atmosphere...and people have done this by looking at the Earth shine. So, right around the time of the crescent moon as it's setting right after the sunset, you often can see the un-illuminated part of the moon and that's because what you're seeing is the sun's light reflecting off of the Earth's atmosphere, bouncing off the moon, and coming back to your eyes. You can take a spectrum of that earthshine and there are signatures in that that if we saw that in other planets, we would say, ""Aha, there is something that..."" — and I don't know the details of which element does what — ""...that is not in an equilibrium given the temperature of the Earth."" Lukas: Oh cool, interesting. It seems like astronomy has had so many advances in my lifetime, which is so cool. Do you think that that's mostly due to better equipment to see more or do you think it's like better use or figuring things out from what we're seeing? I guess it must be both, but some of the astronomy experiments that I read about just seem totally brilliant of synthesizing...it seems like we could get like one snapshot of the world around us, and it's incredible to me how much physics, or how many things we discover from just looking up into space. Josh: I think a big part of that is indeed the 20th century was the opening of our eyes beyond visible wavelengths. X-ray astronomy really only started in the '50s, gamma ray astronomy around that time as well. Once we get above the Earth's atmosphere, which absorbs a lot of the light thankfully, at other wavebands we just see a whole universe that we either didn't imagine or only had a vague idea that could be out there. So a big part of the 20th century was just opening up our eyes to new wavebands and understanding the connection between different objects and events like supernovae, how they are connected to each other across different wavebands, and what their role is in driving the dynamics of a specific galaxy and what the role is in the creation of elements. We didn't really even know how to ask those questions I think, properly, until the last several decades. So part of it's that, and that opening of the eyes is driven a lot by technology. But then it's, ""Okay, well, I have my eyes open, but they're blurry. So how do I sharpen them?"" There are plenty of things again back to the original conversation at the beginning around co-opting tools. Astronomers learned about adaptive optics being used for military purposes and were able to get much clearer images of the sky because we're now pointing lasers up into the atmosphere and exciting a sodium layer high up in the atmosphere, which acts as a temporary star. We have corrective lenses that at many, many times a second are sort of correcting the waveform errors that come as the star's light comes through the atmosphere and gets blurred. So we have all that kind of technology. Of course, digital technology starting in the early 1980s meant we were taking very high-dynamic range images of the same parts of the sky we were looking at before. So we were able to see farther away, see fainter objects, and at the same time there were number of innovations in the telescope world even on the ground that allowed us to build bigger and bigger telescopes. In the end, we haven't gone that far from Galileo's telescope to the world's largest telescopes, 10-meter class telescopes today. That's just bigger and bigger, collecting light. But the innovations that it's taken for us to get there have been real, and have been driven by the need for seeing fainter objects, seeing with greater clarity, seeing across more wavebands. Some of the biggest discoveries in some sense happened outside of the electromagnetic band, over the last many years. The observations of very high-energy cosmic rays — so these very high-energy charged particles moving very close to the speed of light, understanding the origins of those is still an active topic — and the discovery of gravitational waves directly using interferometers on the ground is a massive innovation that took arguably 40 years for us to get there technologically and several billion dollars of taxpayer money that went into that. It took a large number of people to be very, very convinced that the physics was right and they'd be able to get there. So the fact that they were is...one of the great crowning achievements of our field is a recognition that, driven by theory, we were able to invest billions of dollars to get to a set of discoveries that we could have only dreamed of 10 years ago. Lukas: Do you think that gravitational wave sensor was more of an engineering feat? Because it just seems so incredible to be able to send something so small, or was it more of a theo-...what was the hardest part of that? Josh: Well, the early days — that predated me — where theorists were in active discussion about whether you could even use these circle interferometers with lasers to look for this deformation. Once people became convinced that the theory was right...you're exactly right, this became an engineering feat, which — to maybe more of an interest of your listeners — is about project management and about people management, and bringing the right people to the table with the right skill set. And recognizing, of course, that the entire endeavor doesn't need to be one big innovation, right? There are places where you absolutely need to innovate and create new things that don't exist for you to get to your goal. But to do this essentially on time and on budget on a 20-, 30-year time scale is just mind-boggling. Lukas: Is the achievement of that just sort of verifying that gravitational waves exist or do we have kind of a new type of sensor that might somehow find interesting stuff in our world? Josh: Well, without trying to prejudice where things are going, I will say that the history of astronomy — in that context of opening up your eyes to new things — invariably leads to discoveries that were unexpected. So far, I'd say the only large unexpected thing that's come out of the gravitational wave set of observations is the sheer mass, the enormity of the individual black holes that are colliding. There wasn't a lot of great motivation to say that we'd be seeing 100- and 200-solar-mass black holes that were colliding into each other. In some sense, it comes back to the astrophysicist to ask the question, ""How do you even make 100-solar-mass black holes and then put them in the vicinity of another 100-solar-mass black hole?"" We were thinking in the end it would be 10-solar-mass and 20-solar-mass black holes, that was the best bet if you asked most astronomers. So there's a little bit of surprise on that. None of us were surprised of the existence of gravitational waves. There had actually been Nobel prizes given out for the indirect discovery of the existence of gravitational radiation by looking at the orbit decay of neutron stars in a binary system. We had known that the...it was very likely that this existed, but the direct detection of that was a very beautiful vindication. Now that we're there and we're having to grapple with understanding the demographics of the black hole population, a real interesting question is how, as I was saying earlier, how can we use our observations going forward to test general ideas about general relativity? Lukas: When I was a kid, I remember learning about the Hubble Telescope and the excitement around... I mean, in general putting telescopes into space was this big exciting project that seemed really cool. Have we gotten so good at signal processing or undoing the effects of the atmosphere that that's no longer such an important thing to do to get our telescopes up in space? It also seems like when I was a kid I had this sense that telescopes were getting bigger and bigger and we were seeing more and more things, but has that maybe stopped? Do we still aspire to make even more gigantic telescopes to see deeper in space? Where do you think that's going? Josh: It's a great question. It depends on who you ask. There isn't a general consensus of the right answer for that, which is good because the right answer is you do what the science demands. There is a very successful satellite that was launched into space called the TESS satellite², whose sole purpose was to look for Earth-mass planets around sun-like stars and to find those just using the so-called transit technique, where one star...where one planet moves in front of its parent star and that star slightly dims. To do that, you need to see the dimming of a star in one part in 10⁵ or one part in 10⁶, which you just can't do from the ground. There's just too much atmospheric flickering that you just can't correct. We can get down to one part in 10³ or maybe one part in 10⁴ from the ground, but that's pretty much as far as we're able to go. So for finding exoplanets of Earth-mass size, you pretty much have to go into space. Rather than build a huge telescope, what they did is mount the equivalent of a bunch of glorified cannons and a bunch of glorified iPhones to look at a very large swath of the space so they could study many, many stars simultaneously. There, they weren't all that interested in looking at stars that were faint because once you discover one of these planets, you want to have lots and lots of photons with other follow-up facilities to actually do all the work. So they actually needed a very wide field to get very bright stars. But there are other people who are launching large satellites with large mirrors because they want to look at very faint explosions, supernovae in very, very distant galaxies. Yes, you can do that from the ground. It just turns out from a price perspective, there are some types of science that are actually easier and cheaper to do from space. My own interest depends upon, ""Can I do this from the ground? If not, what's the simplest and cheapest thing that we can do from space?"" One of the things I'm really excited about, which you may not be aware of, is there is quite a big and interesting push now towards smaller format satellites, i.e. CubeSats, in part because if you have a very dedicated science goal, and you need to look at, let's say, one object for just a month but you need to do that at one second cadence, that's really hard to do from the ground. But you could potentially do it from space very, very cheaply now because the actual parts are largely commoditized, and the launch — which is a very dominant cost for heavy space vehicles — is more or less zero because there's so many launch vehicles going up in space for all these different reasons. You can piggyback a whole bunch of these small satellites more or less for free. So, what I think you'll see in the next 10 years or so is a Renaissance not so much at the large-telescope level, but at the small-telescope level in space. And the last thing I'll just say is that we sometimes have to go to space because the Earth's atmosphere blocks certain wavelengths. So if we're interested, for instance, in the ultraviolet sky — in phenomena at the ultraviolet sky — because of our ozone layer, we'd block off most of the UV light. So we couldn't do anything from the ground. Lukas: Cool. Well, I guess I want to make sure I ask you some questions about machine learning also. I wanted to ask you about...so you have this group ML4Science³, right? I'm curious, what inspired you to put that together and what kinds of stuff you work on there? Josh: Well, it might be worth talking a little about how we stumbled upon machine learning in my research and where that's led to. About 12, 13 years ago, we were actually dealing with lots of images coming off of telescopes from the ground. The normal behavior when you get lots of data had been — and in many circles still is — just hire more grad students to look at the data. I was looking for ways to scale our way out of what was a very repetitive inference task, which was the discovery of new events in the sky. What we typically deal with is a new image that's taken of the sky, and you have a template image of that same part of the sky taken in the months prior where you've stacked up a whole bunch of really good images, and you subtract the two off. That subtraction process is imperfect because of the atmosphere, because of instrumentation effects. What people would do is look at postage stamps around all the five-sigma positive signals, but most of those are actually spurious. The first place where we landed in the utility of machine learning for my own research was creating what we call a real bogus detector where we trained off of good subtractions, i.e. of real objects, and bad ones because of all these different detector effects and instrument effects. We were able to build something with good enough false positive and false negative rates that we were able to put that into production and reduce the amount of time it would take a person to look at a whole night's worth of candidates from hours down to minutes, and still keep a person in the loop. At the time, I had the conceit that if we can do this, it means then we don't need people to look at the follow-up data. We can actually just get to the point of almost writing a paper without any people in the loop. But as you know well from your current work in your previous company, people in the real-time loop is still important and can be very important even when it's machine learning-assisted. So, that was very successful in that...and that was back in the old days of random forests, before deep learning kind of had its Renaissance. Now, this idea of real bogus discovery is...it happens pretty much in every project going way beyond where we were a while ago, now using modern deep learning techniques. Lukas: Before you go further...in my previous work, I always admired the site Galaxy Zoo⁴ where they kind of got lots of people to crowdsource some of the labeling of these images. Did you look at that at all? That always seemed like such a cool project. Josh: I did look at that. Yeah, I did look at that. I think crowdsourcing in astronomy has been really wonderful as an outreach tool and there certainly have been some scientific papers that have come out of that. In particular, there was the discovery of a weird class of gas around certain types of galaxies that was made by somebody looking at galaxies of images⁵. But a lot of the labeling, if I'm being really honest, by people in the Galaxy Zoo world could have been done and ought to have been done by a machine learning classifier. Is this a spiral galaxy? Is this a red galaxy? The questions that generally are asked in that world...I've done this in classes that I've taught, we have a student for a final project try to reproduce the ROC curves of people in classification, and they can do well. We actually showed for the supernova classification challenge, we were able to build a machine learning classifier off of the original training data from Galaxy Zoo and outperformed Galaxy Zoo in a false positive, false negative sense. One of the challenges that I think all of us have in employing people to do repetitive inference tasks is to ask ourselves the hard question of ""Can I have a machine do it?"" If the goal is to involve people so that they're involved in research and they're helping, that's fantastic. If the goal is to get people looking at data because maybe they'll also see something and answer a question that we didn't even ask, that's fantastic as well. But for the specific task that a lot of crowdsourcing questions have asked, I think especially with where computer vision has arisen, we're able to do that better. Moreover we're able to do it faster and moreover we're able to do it in a repeatable way. So one of the other challenges, of course, if you ask somebody to label a bunch of data and then you ask them to come back tomorrow after they've had a beer and label the same data, you'll get a different answer. From understanding the demographics of everything we see, I think it becomes a lot harder when you have people that are deeply part of that process. Lukas: Got you. So, I cut you off though. I mean, you were doing this quite a while ago and especially vision techniques, I think, have massively improved. I don't know even ""especially vision techniques"", but there's this moment where vision got quite a lot better. Did that affect the way you used machine learning in your work? Josh: Yeah. We always are careful in a sense to try not to look around in astronomy and say, ""That's a computer vision task. That's clearly solvable now by CNNs, so let's go work on that problem."" There is a little bit of ""everything looks like a nail because we have this really cool hammer"". That was a computer vision task, this real bogus detector that we had to solve if we were to break this grad student bottleneck. There are plenty of tasks that people are doing asking questions of images that were around before, but perhaps weren't as interesting because we had no way of solving those problems and now we can do those at scale. I actually focus less on images now and focus more on irregular time series data. But I think one of the important things to recognize about where astronomy is, maybe relative to many of the other fields of the people you've talked about on this podcast, is that we haven't had that moment that maybe that existed in NLP where Jelinek said ""Every time I fire a linguist, my language detector gets better."" The idea that if you start removing domain experts out of the loop and you actually start building language models, just learning off of data, and it gets better and better — we don't have that moment in astronomy. Computer vision is the same thing too. You fire a bunch of old-school computer vision experts that learned about Hough transforms and stuff, and now you just throw it into a big CNN with lots of training data, and you get better answers than what you were ever able to do in the past. That hasn't happened in astronomy. We've used ML in lots of different places as accelerants and as surrogates to harder computations so we can get faster answers. We can do inference at scale in ways we were unable to do before. But it's the same thing in biology, right? ML didn't invent CRISPR and Katalin Karikó — who was this person who toiled away for decades trying to understand how mRNA could lead to a vaccine — she was actively denied tenure and actively denied grants. She had nothing to do with ML, but if it wasn't for her we wouldn't have vaccines for COVID. Biology, I think, also hasn't had its ML moment where you can start firing domain experts and start doing things. Right now, physics, astronomy, chemistry, for the most part we're working in a world where machine learning is this really big important tool in our toolbox but it's not become the fundamental driver of how new insights happen. Lukas: I guess one key difference here though is the work product of astronomy is kind of explaining the world that we live in, right? Whereas...first of all, I'm not sure if I agree with the comment about linguists. I don't want to go on record — Josh: It's not my quote. Lukas: No, I know, I know what you mean. I think definitely linguists still do the best job of explaining language in the... I mean, I think linguists would probably say we use ML techniques to understand language better, not like we've replaced ourselves...although, it seems like modern translation techniques are less informed by linguistics than I might have expected when I was younger. I wonder if it's like a function of a domain being more ""trying to engineer a certain solution to a specific problem"" versus ""do some kind of explanation"". I mean, we've actually talked to a whole bunch of biologists and it does sort of seem like some of the processes around drug discovery are starting to be more and more informed by ML, and moving in that direction. Josh: I think that's where we exactly are right now, is there is a huge amount of ML that is informing astronomy, but I don't think we're there anywhere near where the NLP world is. In part it's because we haven't...to your point, we haven't really been able to articulate a set of outcomes that are comparable or have as much weight or as much import as an NLP task like translation. You can directly correlate, I assume, the quality of a translation from language A to language B to some dollar outcome. And in astronomy, we don't have the ability to do that. So our loss function is a little bit more complicated. As we're learning these various different tasks as part of our workflow, we don't have the ability in the same way many other fields do to articulate that loss function in terms that have this monetary value. When you ask this question about ""What is the nature of dark energy or dark matter?"" or ""How many exoplanets are out there that host life?"", those are in some sense quantifiable answers. But as you're saying, that's sort of where more of the explainability has to come in. I certainly don't think we're even trying to get to the point where we fire a physicist so that I can hire a computer scientist. It's going to be the marriage of those two people, or as an individual and their skill sets, who are going to make a lot of progress. I think the really exciting place where we could get to — and there are little tiny pieces of this starting to happen — is whether an application of ML to a bunch of data can be something that leads to a discovery on a bunch of questions that we didn't even know how to ask. That would be a real hallmark moment in our field. Right now everything is done in largely a supervised context. Obviously, we've sort of had some semi-supervised and unsupervised ways of looking for anomalies and outliers, and things like that. But even that, it becomes a guide to a domain scientist looking at this and say, ""Oh, yeah. Of course, I know what these things are"", or ""This is because the data is spurious."" Maybe what's really fundamental, if I think about it, is that the job of these ML pipelines that we build on different parts of our data isn't so much about prediction in the same way that if I need to predict what the next word is or I need to predict if this is a cat or a dog or what the best thing to show somebody is next, that is the proof in the pudding and you've done well because you can measure what the outcomes are after that. If I make a prediction in astronomy, that's really just for hypothesis testing. If I have a new theory that's gleaned off of data, the job of that theory is to make a prediction about what happens if I observe outside of the domain of the data that I already have, to falsify itself. We haven't really wrapped their head around the idea that ML in the context of the physical sciences isn't just about making predictions at scale so that we can get slightly better data farther down the work chain. If it's going to actually drive our deeper understanding of how the universe works, it has to couch itself in the terms of hypothesis testing, Occam's razor. We haven't really gotten there yet. Lukas: I'm so surprised to hear you say that because it seems like we fund all this work to make better devices and telescopes, and it seems like they pay us back in terms of these really awesome new understandings about the physical world. It seems like you make a bigger telescope, that's just seeing things slightly better however you put it, right? Isn't it similar? If ML can help you get better data to inform your predictions, wouldn't that be a big deal? Does it really need to... Do you really need to be like completely replaced by ML for it to be...? Josh: No, I certainly don't want to come off in trying to make the argument that ML hasn't been important. We're currently working on a project where a big part of that whole data taking, data planning, data reduction, initial discovery, initial inference, initial follow-up, that all happens. There's little pieces of ML through that entire chain, that all happens without people in the loop now, which is absolutely incredible. Telescopes more or less talking to telescopes mitigated by ML. This is where we are. There's only going to be more of that going on. What we're doing is we're optimizing our resources and our resource allocation because we're using ML. But I still see that as fundamentally an accelerant and a surrogate to what we were pretty much doing in the past. I haven't seen anything that fundamentally changes the way we conduct ourselves as physicists. But again, as I said, there are little pieces starting to show up, [like] the rediscovery of the Higgs boson using pure ML without reference to the basic physics of how particles interact⁶. Lukas: Do you think that would've worked without knowing about it? I mean, is that- Josh: That's the question, right? Until we get to the point where someone says, ""I ran my ML pipeline on this particle physics data and I saw this new thing. And everyone in the group didn't believe me until they got 10 times more data and it turns out it was there."" We haven't really, really gotten there yet. There's been a few places where people have found another exoplanet in a complicated data set that people hadn't seen before. But astronomers for the most part are still Bayesians, and we're still governed by Bayes' rule where we come to our problem with a bunch of priors. We get data that updates our beliefs and we get slightly better, or sometimes much better understandings, in our posteriors. If we talk about inference and understanding, we need to couch it in terms of what we think are the physical properties and the physical things and the parameters that describe the object that we're looking at. We're getting better at that. One of the big things I'm doing in ML right now is trying to use different types of networks in a whole new class of approaches called likelihood free inference to go directly from raw observations to posteriors or proximate posteriors. I think that's extremely exciting and can be applied to a whole bunch of different places. Lukas: Cool. That's so cool. One thing that I wonder about...it must be interesting being in your shoes. Are you doing most of this with ML grad students? Are you at the point with your data pipelines where you need to pull experts from industry or maybe... I mean, it's so funny how much of our data pipeline stuff has come originally from astronomy and sciences in general, so maybe it's always at the cutting edge, or do you feel like you need to get experts in terms of just handling these volumes of data sets and building these gigantic models? Josh: It's a great question. So again, I think the answer depends. There are lots of examples in my own research and my own work where hiring very good data engineers, and having some ML expertise on the team, suffices. It's where you actually need to innovate, create new algorithms, take some existing network, and completely blow it up and change the way that it works, that you do need somebody with deep domain background in CS, ML. One of the beautiful things about being on the Berkeley campus is how just everyone across campus is looking to work with each other because, again, we all recognize, at least from the physical domain side, that there is incredible work that's happening in the Computer Science department, the Stats Department. I've just become a member of the faculty of BAIR⁷, the Berkeley AI Research group, so I get to interact with those grad students and those postdocs. We still, I think, face the challenge that any academic arena does when crossing over into other fields of trying to make the kinds of problems we have compelling for the other side, and have the other side recognize that they're not just setting up a Spark cluster for us, and downloading ResNet. What the people in computer science and stats need to realize is that we are asking questions of data in a way that they are not — of the benchmark data sets that they're often working on algorithmically — and because of that there needs to be some real fundamental innovation. I've been really fortunate in my career to have gotten grants that have allowed me to hire people outside of the traditional astronomy background. I hired PhDs in computer science and statistics, and that's where some of the most interesting innovations happen. Where we're at, I'd say now as a profession, is really struggling with the idea of how much should our students have to learn for them to be able to work on this as their main endeavor. We don't have as part of our curriculum a deep training in stats, let alone ML, let alone software engineering. I don't know where they find the time, and where they will find the time going forward, to be able to get all of that in at a fundamental level. We're working on it, we're trying. Berkeley has started a new data science major that the astronomy department is connecting up into with their own classes. But there isn't, at the national level, a holistic understanding of how we're going to do training of the next generation of physical scientists, so they're not just conversant in ML but they can actually do a bunch themselves. Lukas: Actually, the question I wanted to ask you — which, this is a good segue — when I was looking at your website, I found hundreds of research papers, but also mixed in some opinionated blog-like posts on programming language details. I'm actually wondering for you, and maybe something I'm asking myself, how do you stay current on... I mean, how did you even find time to get to a high-level programming at all and how do you stay on top of that? Are you spending time writing code yourself as a professor? Josh: Yeah, I write a lot of code. That's some of my happiest times. In some sense, that's my hobby. I came to programming early on in my academic career when I was an undergraduate, where I was basically told by my future advisor at Los Alamos I can't work there for the summer unless I've taken a class in C, and I did. That was more or less the only class I ever took in computer science. Again, it was this matter of necessity, just like it is with building better telescopes. I decided, when I was a postdoc, to automate an old mothballed telescope — which was a fairly large one-meter class telescope in Arizona — and take all the pieces that had been manual when the telescope had been run before and automate every single piece of it so it could run autonomously. I asked a friend of mine at Los Alamos, ""Which language should I write in?"" And he said ""Python."" I said, ""What's that?"" He said, ""Just do it. It's a cool language."" Lukas: Wait. What year was this? Josh: That was 2002. Lukas: Oh, wow. Josh: I wrote a whole telescope automation software package using state machines and connecting up to device drivers in C++ in 2002, where I was just kind of feeling my way through it. I think I wrote my own datetime module and I didn't realize that datetime was there. So, I just stumbled upon it. And then what you do when you're an academic and you wind up realizing something is interesting, is that you feel bad that you're not teaching it to your students, so you do. So I started in 2008 a bunch of Python boot camps on campus to get people into Python, in part because especially at Berkeley, we caught the open source ethos pretty early and the kinds of languages that people around me were using — like IDL, interactive data language, and MATLAB — were just expensive. Moreover, as scientists, we certainly want to understand the algorithms that we apply and we want to at least be able to look under the hood if we need to. I started evangelizing Python around these parts and started building classes on top of Python. So, a graduate level seminar on ""how do you actually use Python in the real world"", ranging everywhere from doing stats to scaling up Python programs to testing frameworks to interacting with hardware. That class still goes on, but I've got to say I've ossified a little bit around Python. I've spent a little bit of time with a few other languages, but for me I've become conversant enough and gotten fairly deep into this scientific Python community. Jupiter, for instance with Fernando Perez here in the Stats Department, has really been a huge part of what I've used for teaching for a long time. The numpy and the SciPy stack have a lot of activity here on campus as well. Stefan van der Walt has a huge role in that. So, it's sort of in the water, I'd say. Definitely, the proof is in the pudding, having recognized that Python is extremely versatile as a sort of superglue language for all the kind of stuff that we do. Yes, I still code. Last summer during the pandemic, the happiest times were me learning React, so I could build this large-scale React app that we're doing for astronomers to interact over data. Lukas: Isn't React fun? React is so much fun for me, I thought I hated frontend. Josh: I don't know if I would use the word fun. What I love about it, it's just so wonderfully different than the way you think about Python programming. And obviously, it's rewarding in a sense that you build it, you ship it, and users see it right away, in a way that if you build some cool Python tool, you may be the only one in the world that uses it. Just because it's on PyPI, it doesn't mean that somebody is actually going to download it, use it. Lukas: Well, did you use TypeScript with React? Josh: No. Lukas: No? Josh: We were like GSX kind of... Lukas: I see, cool. Maybe I just like React because I think I was writing a lot of frontend stuff around 2008, and found it frustrating, and then went back to it a few years ago, and just was impressed how much things had evolved in the decades. Josh: I love React, but I don't like testing React apps. Lukas: I was trying to do some typing stuff recently, actually it was with your student Danny, and I was really wishing that Python's typing worked a little bit more elegantly, especially in the scientific computing domain. I felt like when I was doing research briefly, the code bases were truly messy in a wI've never experienced in industry — this may be a long time ago — do you do things in your lab to keep things maintainable? Maintainable, as like students come in now and they need to do various research projects. Are you able to find time to clean up code and eliminate tech debt and things like that? Josh: I think we're probably better than most, but never as good as I'd like to be is probably a reasonable answer. I'm at least aware of the existence of things like unit tests, unlike many of my colleagues in our field. Yeah, it is a mess. Again, it comes back down to loss functions and incentives, right? Lukas: Yeah, totally. Josh: When we write a grant, there is no imperative — as much as I think it'd be great — to say, ""By the way, one of the outcomes if you're writing code has to be that this is going on GitHub and that it's going to have like a CI/CD like a Travis attached to it so that when pull requests come in, you know whether they're going to be working or not."" There's none of that at all. So if you do any of it, you're doing it out of the goodness of your heart at zeroth-order, but as you know first order is because you're doing it to help yourself in the future. Often times, in a research context, — and this gets back to a question you were asking about, ""Do you need to hire ML people to work with massive amounts of data?"" — what I was going to say is that not all of what we do is massive data. Astronomy has a lot of data, but we have only a small number of labels for instance. It's a big data problem, but actually a tiny number of labels for the kinds of stuff that I'm interested in, or zero labels. So how you do one-shot learning is a really interesting kind of problem in a physical context in the presence of noise and uncertainty and model uncertainty. There's lots of questions that we ask in the context of ML that are actually kind of small-ish data problems, or they're large computation problems because the forward model is extremely expensive and requires a supercomputer, but the amount of data we're dealing with is thumb drive-level. But because of that, we tend to atomize our activities around projects, around papers. I read a paper with a student, we figure out a cool new thing to do in the machine learning context, and unless that is going to be like a major new widget that gets plugged into some new facility or existing facility, then it's just out there in the world and people can write papers saying their scaling curve is better than my scaling curve and we can have an argument at a conference one day. That's sort of the end of that code base, right? Whereas, as you know, in the industry world, you're generally not writing code as a one-off and then just casting it aside. So the incentives there to keep things maintainable, keep things up to the latest versions of Python and blah, blah, blah, they just really aren't there for most of what we do. There is a subset of what we do where it absolutely has to be battle-tested because more and more people are going to be downloading it and using it. I tend to see those projects as extremely exciting, but there's not a lot of, I'd say, astronomers who have the experience with full CI/CD pipelines and in production dev ops that I've been lucky to have in my career. Lukas: Let me ask you this, what does your lab's tech stack look like? Are you using PyTorch? What's your standard tooling? Are you on Python 3.7? Josh: It's actually pretty agnostic as students have come in, because I tend to...students tend to gravity towards me who are interested in ML and I naturally gravitate towards them. That's how, I guess, gravity works. Lukas: Sure. Josh: I've been agnostic to whether it's TensorFlow-land or PyTorch-land. I think that's becoming less and less important as TensorFlow has evolved through more towards the PyTorch way of thinking about the problem. If you said, ""Build me an ML thing right now,"" I'd probably start in Keras just based on my own past experience. But obviously I'm looking at code with PyTorch, and PyTorch Lightning I think is — from a teaching perspective — that's...the last time that I had to teach some ML, I was doing it in PyTorch Lightning. Although, I had a notebook in Keras and I reproduced the same thing in PyTorch Lightning. Of course, we had Weights & Biases there as well for monitoring. Lukas: Nice. That really warms my heart. Josh: I've been introducing a new cohort of people to your product. Lukas: Thank you. Josh: That's obviously top of the stack. It very much is a Pythonic world now. As I was saying before in this other large project, which is called Sky Portal, that we're using as an interaction platform — with, now, hundreds of people are using it on a daily basis looking at real data as it's flowing in and interacting over individual objects — that tech stack is obviously more complicated. There is a component of it which is slightly external to the stuff that I built in my group — but is part of our project — which is more or less a large MongoDB engine that's dealing with terabytes of data, and there's a bunch of ML plug-ins to that that run in real time. And that's, I think, using TensorFlow. Then what we've built is essentially a Tornado-based, API-first backend and it attaches to a really large Postgres database, and on the frontend is React. Lukas: Cool. Well, we're almost out of time, maybe we're even overtime, but we always end with two questions that I'd love to ask you. The second-to-last is, basically, is there a topic in ML that you think should be studied more than it is? Is there something that you would look into if you had extra time on your hands? Josh: Yeah. I mean, there are a lot of things I wish I had more time for. Where I think there needs to be more work in the ML world is around UQ, uncertainty quantification. Astronomy and physics work in a world that's sensor-based, fundamentally, in terms of our observations. Because it's sensor-based, there's noise. So, unlike in the AlphaGo-Atari world where every pixel has a perfect measurement, if you take an image of the sky or you measure some time series, there's noise associated with it. And because there's noise and because there's a finite amount of training data, if you build models off of that, you get uncertainties in the models because of its lack of expressiveness or its overgeneralization or overfitting. Then, you also have a source of uncertainty in what it is that you're trying to understand, just because fundamentally you don't have a perfect measurement, your signal noise is imperfect. I see some of that research, again, coming out of the ML world, but I see some of the stuff I'm most interested in as coming out of the physics-and-astronomy-meets-ML world. I'd love to see more of that more broadly. I think it's partly our fault as domain scientists for not coming up with the equivalent of grand challenges like with protein folding, where if we had this, we would be able to make great strides. We need to have the..not just benchmark data sets for other fields to be playing with, but we also need to be really clear about some of the important questions that we're asking. I think in the end — back to a lot of what we were talking about throughout this whole interview — doing inference and doing interpretability on the models that we build requires a fundamental understanding of the noise model of the data. And without that, nothing of what we do is going to be believable. Lukas: Interesting. I guess that's a good segue into my final question, which is, when you look at making the machine learning models actually work for you — like actually do something useful — what are the big challenges that you typically run into? Josh: Well, it is a good segue from the previous one, because we are struggling, I'd say, as a community with recognizing that there is this large algorithmic toolkit that has been developed in the computer vision / NLP world that we could just take, make a couple modifications to, and do what were already doing better, faster, and at scale. And as I was sort of arguing through the middle part of the interview, that isn't where I think the biggest revolutions are going to come from, or at least I hope that's not where they come if ML is going to wind up being involved. One of the harder problems is articulating what are the really hard problems in astronomy that can only be solved with new ML tools or new ML innovation. We're all working on it in different ways, we all have our different biases, I think we may wind up getting there. The other one is maybe more practical, which is that is very hard to put machine learning into practice. It's easy to write a paper on machine learning and convince a referee that you're doing pretty well. Maybe release some code. Maybe have the referee kick the tires on that code. That's pretty much where we're at as a community. But, trying to get it into a real workflow that affects real people's lives on the other side of that, there's not a lot of us that have experience with it. No one's really trained to do it well. So most of the time when it's done, it's done in an ad hoc way, you know, leveraging some understanding of how software engineering is supposed to work, but as you know well, machine learning in production is a very different beast than ordinary software in production. I don't think as a community, we fully grasp how hard it is. The other side of that, of course, is that because machine learning is so exciting to so many, we're starting to train a number of students that have kind of just enough knowledge to be dangerous. But because again everything looks like a nail when you've got a new hammer, a lot of people, I think, are going off hitting nails that they ought not to be. One of the things that I always say when somebody says, ""What's the worst thing about machine learning?"", is I always say, ""It's because you always get an answer."" Especially in the context that we're looking at, if we always get an answer and we're getting data that's outside of our original domain or some notions of concept drift or something because the instrument is changing, we don't have any guardrails against that. Luckily, unlike in many of the fields that your listeners work in, if we make a mistake, people don't die, and we don't blow up billion-dollar facilities, and things like that. So we live in a little bit of a nice sandbox where the mistakes that we make may have implications for lack of good resource allocation. But we still could wind up making statements about how the universe works that is fundamentally wrong because we don't know enough about what's happening under the hood. Lukas: Josh, thank you so much for your time. I really appreciated it, that was super fun. Josh: This is great. Thank you, Lukas. Great questions. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material, and a transcription that we work really hard to produce. So, check it out.",11409 +Xavier Amatriain — Building AI-powered Primary Care,https://www.youtube.com/watch?v=JBjt1X_zvvE,3014,2021-07-29,"Xavier: How do you connect the offline metrics that you have in anything you're doing in any model in the lab to what's going to be the real impact that that model has on your product? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host Lukas Biewald. Xavier Amatriain is co-founder and CTO of Curai, a ML-based primary care chat system that we're going to talk about today. Before that, he was VP of Engineering at a website called Quora, which I absolutely love. And before that, he ran the recommendation system at Netflix, which is especially famous for the Netflix Recommendation Prize¹. I could not be more excited to talk to him today. I want to start with talking about what you're working on now. I mean, you've had a really long and interesting career in ML, but it probably makes sense to talk about Curai, right? Is that- Xavier: Yeah, Curai. Lukas: Curai. First, can you tell me what Curai does before we get into how machine learning fits in? Xavier: Yeah. I mean the basic level is an end-to-end virtual primary care service. It provides everything that you could need from your primary care doctor, but it provides through an application, through chat. Our goal is to provide the best possible health care at the lowest possible price and make it very accessible and very affordable for everyone in the world, while at the same time increasing quality. The way to enable that, is using technology and more concretely, AI and machine learning. We feel like one of the things you can do through machine learning and AI is to automate and therefore make things more efficient, that's pretty obvious, but the other thing that might not be so obvious is that you can also make things higher quality, right? That's very much related to the notion of data-driven decision-making, algorithms, and science in general, which should be behind all the medical decisions. So, the combination of sort of quality accessibility is what drives our product. But again, our product is basically a virtual primary care service that is provided through an application and through a chat-based interaction. Lukas: And so, could I use it today? If I had a health issue, I could talk to a virtual- Xavier: Yeah. We're now available in seven states in the US. So that's, let me make sure I don't miss any, it's California, Florida, Illinois, Ohio, South Carolina and North Carolina. So those are the seven states. We plan on being available in the 50 states by the end of the summer, so we're expanding rapidly. And the only reason we're not in the other 50 states, it's because there's legal implications of expanding and you need a different license for all the different states. But yes, if you're in one of those seven states, you can download it and start using it for free. After the free trial, the price is very affordable too. So, it's $7.99 a month and you can use it as many times. No copays, you don't pay per usage, it's just like a flat fee and you get everything including prescriptions. You can go and pick up your prescription in the pharmacy, go to your lab test if you need any blood tests or anything and we do all of that through a network of partnerships. The healthcare team, which I'm sure we'll get into, is a combination of humans and AI. Lukas: So, it maybe triages the questions and the ones that are easier the AI tries to answer and then the harder ones go to a human, or how do you think about that? Xavier: Yeah, it's a great question. That's typically the traditional approach, right? You put the AI up front and then whatever the AI decides it can do it does and then you pass the rest to humans. We go well beyond that. We consider the AI to be just another member of the team and the AI never leaves the room. So, what it will do, is it will call other people. We have a care team that is composed of clinical associates, medical assistants, and then licensed physicians in all the states that we operate and then the AI. Now, the AI will sometimes, as you said, will take over the interaction and just drive it and whenever it's either finished with whatever tasks it was doing, or not sure, it'll call in the physician. But it then stays in the room and it provides assessment and augmentation to the physician, so it's both user-facing and doctor-facing. So, the AI is kind of the connection between the two ends. Very importantly, in order to understand this, I think it was kind of implicit on what I was describing, the doctors are part of our Curai care team. So, it's part of the team that is not only providing the care, but also helping us develop the product and helping the system and the algorithms learn from the data that we're generating. This is a so-called learning healthcare system, because we are. At the same time, the AI is helping and augmenting the doctors and the doctors are learning from the AI. But very importantly, the AI is learning from being part of this team and from the data that is being gathered as part of this end-to-end process. Lukas: How is the AI augmenting the doctors? Is it suggesting links to go to for research or autocompleting possible responses? How does it actually work from a doctor's perspective? Xavier: The AI is doing all the above, so yes, it is doing all of that. I mean, as you know, people think of the AI as sort of a magical entity that exists somewhere. And the AI is a combination of different algorithms that are controlled by some protocol, right? So there's different machine learning algorithms doing different things and all of them are augmenting the doctors in different ways. But in a typical, or in a simple, scenario, what will happen is the AI will be part of the so-called history taking and it will start by asking questions to the patients, documenting that as entities in an electronic health record, it will call in the doctor, and then it will say, ""Hey, I have a differential diagnosis, which is a set of possible diagnoses that I think this could be happening. Now, you take it from here, but by the way, I can also suggest questions that you could ask the patient if you want to dig into any of these things."" The doctor at that point can say, ""Oh, wait, this could be COVID. Hold on. Can you suggest a few questions that I could ask the patient to either confirm or invalidate the hypothesis that it's COVID?"" And then the algorithm will suggest questions that either confirm or not that particular hypothesis. As it's going along, it's extracting things from the text because these are all chat-based. It's extracting things from the text, it's highlighting important things. It's also summarizing the conversation for the next doctor that comes in to get a summary, and even going all the way to suggesting treatment if the doctor needs suggestions for a treatment once the diagnosis has been confirmed. Very importantly, the AI or the algorithms never make the final decision to either diagnose or to treat. That's always on a physician and we always say it's very important in this kind of environment to have the physician in the loop and to have the physician make the final decision, but we can augment them and make them much more efficient, but also better quality, right? Because in our offline analysis in the labs, our diagnosis algorithm, for example, has higher accuracy than the average physician. So, we're pretty confident and they keep getting better. We're pretty confident that those diagnosis algorithms are going to be better than most physicians. And even with that, we're not saying, we're just going to make the diagnosis. We're just presenting it to a physician and saying, ""Hey, this could be one of these 3 things or these 10 things. How do you want to go from here?"" Lukas: I can totally see from a communicating with the patient standpoint, including me, that it would be comforting to say, ""Hey, the doctor always makes the final decision."" But this is more of an interview about real-world AI. It does seem like [for] example like, chess used to be before I think, Alpha Chess or maybe the latest version of Stockfish, the best chess programs were these hybrid systems with the human in the loop. But then, at some point, the AI got good enough that the human loop only messes things up, right? Do you ever have cases where you think that the ML system works better than a human operator and maybe it shouldn't actually give the final decision to a doctor? Xavier: It's a great question. I think, as I said before, generally speaking, it's not that hard. Well, I mean, it's taking us a few years, but it's not ""that hard"" to get an algorithm that's better than the average physician. Now, that being said, it's much harder to get an algorithm that is better than the combination of the human plus the AI. So, even in the examples that you're mentioning, the combination of humans plus AI in chess, if the human is relatively good, meaning a professional player, it's hard to beat, right? So, an AI alone versus a combination of AI plus human is hard to beat. In the case of healthcare, one of the important things to understand is that it's an imperfect information game. It's not about...if you had the perfect information, the algorithm would probably always beat the human, right? And it would be very easy to just beat the human with sort of all the perfect information in the world. However, in the case of medicine and healthcare, there's a lot that goes on with empathizing with the patient, understanding, even things that are called social determinants. Where do they come from? How are they going to understand? How can you communicate the possibility of something being likely or not? And that is very hard to do if you're not a human that is trained to have this level of empathy so to speak, right? So there's the interesting question and I keep talking to people that have very different opinions with that, right? There is the purely extreme rational opinion that all you want to have from the outcome as a patient is have a list of possible diagnoses with a probability and you'll be able to interpret them. If you're a hyperrational person, that is true. You want to know if you have a 0.2% probability of having cancer, you want to know that there's 0.2 probability and you think you can deal with it. The reality is that most people don't know how to interpret that, right? What does that even mean, a 0.2% probability of having cancer? Do you want to communicate that or do you want to interpret that and then follow the patient along and make sure that that probability doesn't get to a point that is more likely than not? I think that's where the human judgment is really key and that's very different from a pure probability that is out put from any kind of machine learning algorithm. Lukas: Interesting. I guess I would think that I would actually want to have the clear probabilities, but maybe everyone thinks that and they don't really want that. Xavier: No. I think you're probably right. If you are in the tech bubble, so to speak, and you're rational and you play music and you're a mathematician or you like math, you think you can very rationally deal with those kinds of probabilities and work with them, but there's a lot of people that are not like that and that's where the empathizing and understanding who you're talking to, it's really key. One of the important aspects, which is somewhat connected to what we're talking, is in particular, our service, we are not designing it for the tech savvy people of Silicon Valley or anywhere. We're really using technology to provide a very accessible and high quality service for people that usually don't even have access to high-quality healthcare and they're under-insured, uninsured, and so on. We need to understand the social background of how are these people going to be interacting with the technology and how they are going to need the human aspect of the technology to help them even understand what's happening and how to react to it. I think that's also very important because — I mean, we could get it...this is more of a philosophical... — but we get usually blamed in tech companies that we design things only thinking of people like us. And then you realize — and particularly in healthcare, it's very interesting because as soon as you start talking to doctors and to anyone from sort of medical profession — you understand, it's like, ""Gosh, yeah."" The way of thinking is different. It's like, even how they think, it's not purely mathematical and you need to have a level of understanding of the different ways that people interpret and process information. Now, that being said, I'm not saying that the traditional paternalistic view of medicine is good. The one which the doctor knew everything, and wouldn't say anything to the patient [but] say, ""Trust me, I know the truth. You have to do what I'm telling you, but I'm not even going to say what your diagnosis is."" No, I am totally against that. I think there needs to be a middle ground and the patients need to have access to their data and we need to be transparent with what's going on and give information as much as possible. And that's part of our model too, for sure. Lukas: Going back to a comment you made earlier that your diagnosis is better than the average doctor, or I guess that your system's better than the average doctor...My first question on that is, how would you even know that? Do you follow up and find out later what the real diagnosis was? Also, how would you train a system to be much better than the average doctor? Do you somehow have a way of finding more accurate doctors and then using that for training or how does that even work? Xavier: This is a great question. So, when I said that, I specifically added, ""in the lab"". We're better than the average physician in the lab and that's because the only real ground truth that we have to evaluate this are the so-called clinical vignettes, which are basically cases that are documented and they're agreed upon, and they've been published. There's not many of those, unfortunately, so that's something that is lacking. But, when we are making diagnoses on those vignettes, we kind of agree that that's a ground truth that's been published and that's the one that we use as the measuring bar. There's a public dataset, which is pretty small, but we also have our own internal one that we keep using for development. And we even use synthetic data and all kinds of different data that we can get into. Now, unfortunately, the generation of ground truth in medicine is extremely hard. There are a lot of studies out there that with doctors, for example, there's a — well, famous in our field — well-known publication by the Human DX Project² where they found out that the average accuracy of a single doctor on similar vignettes to the ones that I'm saying, so medical cases, it was roughly around 60%, between 50% and 60%. And in order to get past a reasonable 80% accuracy, you had to aggregate the opinion of six to eight doctors. So basically, the only way you have to really increase that accuracy is saying, ""Okay, I'm going to ask eight different doctors and then take the opinion of the ones that agree the most and use that as my ground truth,"" which is, honestly, what many of us do in the lab to generate those vignettes. It's not ""trust one single doctor"", but ask many and then have quality processes to understand who is right and then take that as the ground truth. But in order to have a learning healthcare system and sort of have this system in proof, the only thing you can do is establish those mechanisms in which the system is actually learning and improving from itself. And you do have sort of humans in the loop having to follow up and saying, ""Okay, we diagnosed this first, was this correct?"" Very importantly, you also have the ability to have follow-ups and very constant follow-ups to understand if you got it right or if you missed something. One of the nice things about the system that we have, which is all virtual and chat-based and message-based, is that we can follow up, and we can automate follow-up, with the patients at very little cost or almost no cost. So, we can literally have the patient come back every hour and check on the patient. It's like, ""Hey, did the fever go up, did it go down? Did we get it right or not?"" Which is usually not the case in a normal medical situation, right? You go see the doctor and then if you're lucky, you see them in two weeks. The sampling time between different data points is much coarser than what we have. Lukas: Yeah. It's funny, I'm thinking about my own interactions with doctors, and I was thinking, when I call a hospital or call my doctor to ask them what to do, I feel like I can almost guarantee that they're going to ask me to come in and get more tests. My little sister is also a doctor and I feel like when I call her, I can almost guarantee that she's going to tell me, ""Lukas, you're fine. You're being ridiculous. Drink some water and get some rest."" And so they're clearly optimizing, my sister and a professional that I call, optimizing for kind of different things. How do you think about that? What do you optimize for new interactions? I would imagine that missing a serious condition would be so bad that you would really want to err on the side of caution with your suggestions to patients. But how do you know if you're doing a good job there? Xavier: Yeah. I mean, definitely, patient safety is uttermost concern and one that is very critical and our care team is very much fixated on patient safety first. We do things that even go against what would be good for the business because of patient safety and that's understood, and it's the right thing to do. However, one of the important pieces here around patient safety and around not erring on the side of being extremely conservative is, one, the population that we are dealing with is population that doesn't generally have good access to healthcare. So, if our response to their concerns was always, ""Hey, go and get a blood test and you need to go through this super expensive procedure and good luck with it and come back to us,"" and that would be the kind of service we would be providing, these people would not come back, right, because they literally cannot afford it and it's not something that's optimized in any way. So, we need to provide the best possible care with optimizing also the cost side of the equation for them and for the overall system. And the reason we can do that is because we have this high-level of access and accessibility. So, we can play it safe because we can always tell them, ""Hey, come back in two hours if your fever gets past this,"" or ""If you start coughing tonight, come back."" That's something that most doctors...one of the reasons they err on the safety side, sometimes excessively, is liability, but the other one is because they can't assume that they're going to be in touch with you for the next few weeks, right? So it's like, ""Gosh, I need to just make sure that this doesn't happen in the next two weeks."" If they had the ability to say, ""Hey, you're going to be calling me in every two hours if there's something happening,"" they could take a little bit more of a little less aggressive approach, but that's usually not possible. In a system like this where there's a lot of automation and a lot of accessibility through a virtual...and through an application, through a phone, you can actually do that and it's much more efficient. More importantly, it's more efficient also long-term for the health of the patient, right, because you're catching things right when they happen and you're not letting it get to a point that it's like, ""Oh gosh, now it's too late. Now we need to do this surgery."" Lukas: Can you tell me a little bit about your tech stack behind the scenes? You're actually really deploying, it sounds like, multiple models into production and running them live. Are you continuously updating these models? How do you think about that? Are you retraining them constantly on the feedback that you're getting from the human operators? Xavier: Yeah. There's a combination of different models and each one has its own cycle. We do have what we call the learning loop, which is the ability to inject data back into the models and retrain them. But there's a combination of different models that have different levels of, I would say, velocity in the way that they can be retrained and they can be redeployed. In my experience, that is not any different than any other company. When I was at Netflix, we had the same. We had some that had a lot of data and were retrained daily and there were others that, honestly, they needed to have longer windows of data and more data to be retrained and you didn't need to retrain that that often. So, we're in the same place. Particularly, for example, things like diagnosis models, we don't get that much good quality granular data on diagnoses daily, right? So, it doesn't make sense. And we need to make sure that that data is high quality, we combine it with synthetic data that we generate from a simulator and there's a lot of sort of data cooking going behind the scenes for making sure that those diagnosis models are good. So, that's a model that is not going to be updated that frequently. Now, there are others that are around, say, entity recognition or intent classification or things like that, that we do gather more constant data and those can be updated more often. I will say, just to clarify for everyone who's listening, our modeling and even our research is at the intersection of natural language on one side and then medical decision-making on the other. And they both intersect, right? So there's an intersection of both, but we kind of go all the way from using GPT-3 and language models, to using synthetic data from expert systems to train diagnosis models. There's a very cool intersection of both things, whereas the purely knowledge-based, knowledge-intensive approach of traditional AI systems in medicine and all the way to language models and very much deep learning approaches. We have different models that are in the intersection of those. Some of which, as you can tell, the ones that are more on the data intensive language side, we do get more constant data and we can retrain. The ones that are more knowledge-intensive, we have to sort of do intermediate processes so to speak. Lukas: That makes sense. Do you literally use GPT-3? Xavier: We do. In fact, we just published a paper about it³. We won the best paper award at one of the works up at ACL. In that particular case, we were using GPT-3 for generating training data for language summarization. So, that's an interesting approach, I think, one that I know several people are following in different domains. Instead of using GPT-3 directly at inference time, to use it as a way to enhance and generate high volumes of training data with different priming mechanisms, it's a very interesting approach and one that we showed in our publication that it's actually better than just having a lot of humans generating training data. So that's an interesting case. Lukas: Can you tell me more about how this works? How do you exactly generate the data and what's the...is it a summarization task? Xavier: Yeah. It is a summarization task and summarization of medical conversations is pretty hard because you need to generate the data, but also you need to generate data that is...Sorry, you need to have the original data, but then generate summaries and you need to generate summaries and examples of summaries, which are mostly correct, but some that might be incorrect to make decisions on where you're training the model. It has to learn what is a good medical summarization and what's a bad medical summarization. So, in the case of this project, what we did is prime GPT-3 with a number of examples of both positive and negative summaries to conversations, and then have it generate thousands of different training examples that we use to train our own offline model. And interestingly, I mean, the availability of more data, but also more nuanced variabilities that GPT-3 was generating itself, was made that the final model that we were training was better than anything that we could have trained with our own data and our own human labelers. Lukas: It's so interesting because you would think that the generation task could be so much harder than the decision task, if something's a good summary or not. It's kind of amazing to me that that works so well. Xavier: Yeah. I mean, to be clear, we could have tried to use GPT-3 directly for the task at hand, if we had had access to sort of unlimited resources and fine-tuning. By the way, I know that OpenAI is going to open the API for fine tuning soon, but we didn't have at that time. Also very importantly, there's a tricky aspect here with the privacy aspect of the data that we're dealing with, right? We don't want to be in a situation where we are sending GPT-3 data that is private from our patient, unless there are some guarantees of very strict compliance and privacy. So, if all those things were met, you could use GPT-3 directly and you would probably get a summarization that is as good as the one that we were generating. However, because that did not exist, it's a very interesting intermediate step to, again, prime GPT-3 with some knowledge and some examples, and then let it generate all these other training examples that you can then use to train your own. I mean, you're not going to train a GPT-3, but you don't need to, right, because the complexity of the model and the number of parameters that GPT-3 has is because it's a language model and it's a universal language model, right? But the model that you're training, which is very much focused on summarization in a particular domain, you can train a much smaller model, much more efficient with the right data and you're going to get the same... I mean, I'd be interested, I don't know if it's exactly the same accuracy or it's even better because, again, there's the question of how much a universal language model can be as good as a smaller model on a very specific task, right? Which is what we train. Lukas: That makes total sense. That's really cool. Do you worry about training on the conversations that you have? I imagine those are incredibly sensitive conversations with patients. If you use that data to train models, is it possible that some of the information could kind of bleed through into the models? Do you take precautions somehow to try to remove personal identifying information before you train a model on the data? Xavier: Yes, we do. All our models...sorry, all our data sets that are used for our models go through a de-identification process and we do make sure that the identifiers that our original data sets have are actually extracted. That being said, you can never guarantee 100% perfection, right, on those. De-identification of texts is in itself a research task, so there's different approaches to it and there's different things that can be done. But even with that, you'll get as far as a particular percentage of accuracy. We're pretty confident that most of our data sets that we use to train the models are pretty well de-identified, which then in turn means that the likelihood that then something even bleeds into the model is very, very small, right? Because it would need to be the combination of ""something makes it through the identification step"" plus ""something gets picked up in the model that then can be retrieved"". But that, otherwise, sure, it would be a concern. Right? Lukas: Do you have systems to evaluate the quality of the models before you deploy a new one into production or do you do live production monitoring on the quality of models as they run? Xavier: We do have systems and we do have a process in place. We have different data sets, different metrics and different sort of processes to make sure that we detect any anomaly It's interesting because, in fact, I was talking today to François⁴ who is leading my AI engineering team. They're building a tool now that we're going to be using that basically automatically enables you to analyze the anomalies that we detect when we change a model, but by seeing actual examples of what is the actual case. I was talking about the vignettes that we have, for example for diagnosis, right? So if you train a new model and all of a sudden you see a different like, ""Hey, this metric is lower than in the previous version of the model."" That's okay. But in this case you really want to understand, is it being unfair to a particular demographic? Is it worse for older people or for women or for...or can I actually go and see where it made the error? And then, interestingly, now you need the collaboration of a physician or a doctor to sit with you and say, ""Hey, this new version of the model decided that this thing, instead of the flu, was a cold. Is this correct or what's going on?"" And then you need to debug. In most cases — and I know this is something that in other companies, people have this kind of debugging tools — but they are usually debugging tools that a layman or a layperson can understand. When I was at Netflix, we did have a similar tool that you would see the shows and like, ""Whoa, this ranking doesn't make sense."" But if you're dealing with a highly knowledge-intensive domain like medicine, you actually need that collaboration with the doctors. And we do have doctors in the development team and we do have experts that are sitting hand-in-hand with the engineers and the researchers to do those kinds of iterations and debugging and QA of the medical models. Lukas: That's cool. So what does the interface look like? It sort of shows somehow the explanation behind why the model made the choice that it made? Xavier: Yes, yes. It shows the overall difference between the previous model and the current model, and then you can click and see sort of, like, ""Okay, what are the ones that it got right and what are the ones it got wrong?"", compare the two models and you can kind of see the diff with a color code, so you can actually dig and say, ""Okay, well, yeah, this one they got wrong. It's very wrong, so we should not move forward."" Lukas: Do you try to build models — I mean, GPT would be kind of the furthest from explainability you could go to, but do some of your models — do you try to build them in ways that they maintain explainability? Xavier: Yeah, it's a great question. I think explainability, it's important, but it's also kind of tricky in the case of medicine, in the sense that not even doctors many times have an explanation for their decisions. In fact, something that is kind of a little nuanced, but I think it might be interesting is, many times, doctors will go all the way to prescribing without having a clear diagnosis. That's called symptomatic treatment, right? So it's like, ""Oh gosh, I don't know if this is flu or a cold, but no matter what, I'm going to prescribe you this particular thing because it's going to be good for both things."" And they don't really have a clear diagnosis. That's not bad. I mean, it's okay. It's better to do that than to do nothing. In fact, the good thing is to be doing some symptomatic treatment and then following up and understanding like, ""What's the evolution? Did I get it right or not?"" So, as long as you have a possibility to follow-up. So, explanation is not always possible and it's not always available in an imperfect information situation, right? Now, that being said, if you do have it, it's good to provide it and it's something that we have definitely worked on, on providing explanations. I'm actually a fan of adding explainability as a post hoc process to the model. I think it's something that has a lot of value and does not necessarily require the model itself to be explainable, but you need to go after the fact and understand like, ""Okay, this is why the model picked this and is there an explanation that can explain in a simple way, why did the model pick this particular option or this particular cluster?"" Lukas: So, how do you do that? If you have a really complicated model, too complicated to inspect, what kinds of methods do you like to use to get at some explainability of why the model did what it did? Xavier: Yeah. I mean, there's different approaches to adding explainability, right? I mean, the simplest one is you approximate the decision boundaries of your model, no matter how complex, no matter whether it's a deep model or not, by a simpler linear model and then use that to build the explanation, right? That is a typical approach that many of the explainability solutions take and that's one that can actually work pretty well. And it's one that we have experimented with and even implemented. I will say that's not really implemented in the product yet, but it's been implemented sort of as a prototype. And I think we even wrote about it in one of our blog posts. So that's, I think one of the easiest, but also at the same time, more effective ways to explain things that have a complex non-linear decision boundary and cannot be explained in easy terms. I will, again, say that in many cases in medicine, those decisions do exist, right? And even though as much as we try to infer causality from the decisions, those are hard to come by because there's a lot of nuances in the ways that the information is being processed and the decision boundaries of the models are being constructed. Lukas: Interesting. Well, we're getting close to running out of time and we always have two question that we end on, and I want to make sure that I cover them with you. The second-to-last question that we always ask is, what's an underrated aspect of machine learning? And I guess I would say across your career at Netflix and Quora and Curai, what's been a topic in ML that's been maybe particularly useful to you or important that you feel like research doesn't cover as much as it should? Xavier: I think an important topic that is not covered in research enough, despite the fact that I've tried to put in myself because I was a researcher back in the days before going into industry, is what actually happens to models in the wild, right? It's like, it's a different thing that you build a model with perfect data that has been cooked in the lab and you know what it is and you can have control over the boundaries and even understand the distribution of the noise and all the different variables, than to deploy a product in the wild and to then be faced with all kinds of different drifts and the data distribution noise and whatnot. I think that is something that is not usually researched enough, understandably so, because you can't. I mean, most research is done with data sets that are available and are distributed in a way that they're kind of artificial, right? I mean, I went into Netflix through the Netflix Prize⁵, so I know that that was a dataset that was very, very good and very exciting to make progress in recommendations and the recommender systems arena. However, it was very different from the data that I found out we had at Netflix when I was there, and there were kind of all these other things happening, right? Lukas: Right. Xavier: So, yeah. Lukas: And I guess it's also kind of a hard problem to formalize, right? There's so many variations on it. I mean, there are a lot of people talking about it, but it's hard to...I guess it's like ML robustness or something, right? Xavier: Yeah. I think, yeah, robustness is one. I think another one that is very interesting, we have done some work on that is, out-of-band classification or prediction. It's like building models that can react to classes that have not been seen during training. For example, that's a very important aspect and one that gets a little bit of attention in research, but not so much. I will say, for example, that particular problem is one that's relatively easy to replicate in a lab, right? So basically, you can build models that say, ""Hey, I have 100 classes, but I'm only going to let the model see 50 during training, see what it does with the other 50."" And the model needs to understand, ""Hey, I haven't seen this class. Sorry, I don't know what to say."" So that's an example of something that...out-of-band classification is one that kind of mimics some of the problems you see in real life. Because in real life, you will deploy a model and it will see something that is very different from the things they've been trained on. Having the model be able to raise their hand and say, ""Hey, I don't know what this is because I've never seen it before,"" that's a very interesting, for example...it's a specific concrete case, but it's one that relates very much to having these models in real life and being able to replicate these kind of situations and in a lab. For example, another one that we've worked on is on introducing artificial noise on the training and testing by using some domain knowledge, right? For example, in medicine, you know the prevalence of some symptoms and you can say, ""I know that if I ask people if they have a headache, many people are going to say yes, because most people have a headache"", right? It might not even be related to the current situation and the current condition, but people are going to say yes. Well, you can play around with those knobs and introduce artificial noise in your training datasets to then anticipate some of the noise you're going to be finding in the wild and in real life. So, that's another example. It is hard to recreate the exact situation you're going to find out there, but I think there are some interesting ways to mimic at least some of those situations that probably deserve more attention than they usually get. Lukas: It's fine if not, but do you have a favorite paper on the topic that you'd like us to point to or any research you could send people to who want to learn more? Xavier: Well, I have our papers that I could point you to. I mean, the two things that I mentioned we did...dermatology, image classification with out-of-band distribution⁶ — that one that refers to the first thing that I was talking about — and the artificial noise that we introduced in synthetic data, that's a paper we wrote on diagnosis and diagnosis training⁷. And of course, those papers cite a lot of other papers that could be interesting, so I could definitely point you to those. Lukas: Cool. That's perfect, having a good starting place. Xavier: Yeah. By the way, if people are interested in our tech block at Curai, we have a full list of our publications⁸. We probably have now in the order of 15 or 20 publications. I like to be very open about the research we do and I think that comes from my old times as a researcher. I'm very much in favor of open publications and sharing knowledge and you will find most of that in our blog posts. Lukas: Cool. Awesome. We'll definitely put a link to that. My final question is kind of broad, but I'm really curious in your case. Basically, what's been the hardest part, or maybe the most unexpectedly challenging part, of getting ML models deployed in production? Just kind of going from conception of ""This is the thing we want to do"" to ""Here's a working model"". Where are the big bottlenecks? Xavier: Well, yeah, that is a broad question and I think there's a number of things that come to mind. I think at the highest level, one of the very difficult things to get right, is how do you connect the offline metrics that you have in anything you're doing in any model in the lab to what's going to be the real impact that that model has on your product. We would love for that to be a clean thing and say, ""Hey, if I get my precision and recall, then my F1 measure increases. I know that that's going to work in production and that's going to generate this much lift and whatever."" People are either going to click more, or be happier, or love the product more. That's usually that road. From what you see in your model in the lab to the modeling production is not that straight. And there's a lot of issues that get in the way and a lot of questions and a lot of things that are really, really important to get right. Some of them relate to mundane things like the UX, right? How is the user experience? How are you presenting things to, in our case, the patient or the doctor, and how are they reacting? Your model might be awesome, but it's like, if they're not seeing it or it's confusing, or you're not explaining it right, that's not going to help in any way, or it might be worse. It might be confusing. So, I think the connection between research and modeling and the actual user experience and the interface and how that's actually introduced into the product is an aspect that I find fascinating myself and it's very hard to have people that actually understand the end-to-end. It's like, because you need to have a very broad experience that goes all the way from the modeling to the metrics, to the product, to the user. Understanding the user research, that's really hard to cover end-to-end. And then you need to build all that through collaboration and through teams that have the ability to collaborate. In medicine where that's even harder because you throw in the domain knowledge, that even becomes more tricky, right? It's like something that you might see in the lab and it's like, ""Whoa, this is fantastic. This is getting the metric. This is actually going to be a killer featur"". It might turn out that it's a killer feature in the wrong way. Sorry, I shouldn't have used that metaphor probably in this context, but it's important to understand that the results that you get in your experiments are mediated by many things before they can be evaluated in an A/B test, for example. Lukas: Awesome. Well, thanks so much for your time. I really appreciate this conversation and it's just super interesting. Xavier: Thank you. Yeah. Lukas: If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out. If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team. We're looking forward to meeting you.",7688 +Spence Green — Enterprise-scale Machine Translation,https://www.youtube.com/watch?v=A9bTVXaKI1Q,2626,2021-07-15,"Spence: Translation is in this space of so-called AI-complete problems. Solving it would be equivalent to the advent of strong AI, if you will, because for any particular translation problem, world knowledge is required to solve the problem. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Spence Green is a machine translation researcher and also the CEO of a startup called Lilt, which is a leading language translation services company. He has been using TensorFlow since the very beginning and has been putting deep learning models into production for longer than almost any of us. I'm super excited to talk to him today. I think the best place to start here is, you're the CEO of Lilt and you built Lilt, maybe you can just give us a description of what Lilt is and what it does. Spence: Well, I think it's important to say where the company came from and the problem that it solves, and then I can explain what it does. I think what it does follows from that. Lukas: Perfect, that's great. Spence: Where it started, at least for me personally...in my mid 20s, I decided I want to learn a language. And so I moved to the Middle East for about two and a half years. And while I was there, two important things happened. The first was...So I was learning Arabic and I had a friend, and I was talking to him one night and he said...he was like the building watchman in my building. and I was talking to him and I was like, ""What did you do in Egypt?"", where he was from. And he said, ""I was an accountant."" I said, ""Oh really, why aren't you an accountant here?"" And he said, ""Because I don't speak English."" I was like, ""Okay, well, we're in an Arabic speaking country and you can't get a job as an accountant"", and it's because people make a certain amount of money if they speak English. If they don't, they make less, and I had never really encountered that before. Six months or so after that conversation, Google Translate came out, and I got really excited about that. I left my job and went to grad school, started working on MT. And then a couple of years later, I was at Google working on Translate where I met John, my now co-founder, and Franz Och, who started the group at Google and did all the really early pioneering work in statistical MT. We were originally talking about books a lot and why books don't get translated, and we found that Google's localization team that did all of their language-related work for the products didn't use Google Translate. This was amazing to me, why would this be? And the reason is because, in any business setting or non-consumer setting, you need a quality guarantee. An MT system, like any machine learning system, it can give you a prediction, but it can't really give you a grounded certificate of correctness about whether it's right, and that's what businesses want, or book publishers, or whatever. So we started building these human-in-the-loop systems where you need the human for the certificate of correctness, but the crux of the problem is to make that intervention as efficient as you can. Lukas: I mean, I guess my biggest question that I was thinking about, that I've always wanted to ask you is, how different is the problem of translating something properly versus setting up a human-in-the-loop system with a human translator to translate well? Is it almost the same problem or is it quite different? Spence: By translating it properly, what do you mean? Lukas: I guess, I mean, so Google Translate is just trying to give you the best possible translation. Spence: Got you. Lukas: I assume that what you're doing is like helping a translator be successful translating something, presumably by guessing likely translations. Spence: Yes. Right. It's a good question. So the question is the mode of interaction with the machine. The way that machine translation systems have been used, really, since the early '50s, was when this line of research started. It's funny that machine translation was like this really old machine learning task and originally people thought the digital computers that were developed during the Second World War for bomb making and for cryptography, the initial idea was, ""Russian is just English encrypted in Cyrillic, and so we can just decrypt Russian."" The initial systems that were built in the '50s weren't very good. The naive idea was ""Let's just take the machine output and pay somebody to fix it"". And this linear editing workflow is what our work in grad school was about, was going beyond that in some way, like a richer mode of interaction. What we came up with was effectively a predictive typing interface. There are two problems that we really wanted to solve. One was, when you're doing translation, the system makes the same mistake over and over again, documents tend to be pretty repetitive. It's an annoying user experience and it's inefficient when the system just makes the wrong prediction over and over again. So the solution to that is to have a system that does online learning, which was part of the work. The other was, ""How can you interact with a text string beyond just using your cursor and fixing parts of it?"" And that is doing predictive typing. So if you put those two together, you want to do online learning and you want to do predictive typing, it's a fundamentally different system architecture than the type of system you build for like, Google Translate system architecture. Lukas: Although it seems fairly close. I mean, the predictive typing, I would think you have a language model and a translation model. Is it sort of the same... or at least that's how MT systems used to work, or at least in my memory, right? Is it? Spence: That's the way that the statistical systems used to work and really it came down to doing inference really rapidly. Well, yes, it came down to doing inference really rapidly and doing inference with a prefix. Instead of just decoding a sentence with a null prefix, you send a part of what the translator did. The old systems...we actually had a paper on this a couple years ago¹, how to do inference with a prefix was an algorithmic problem that you had to solve. The new neural systems just do greedy beam search, so it's actually pretty straightforward to do that these days. Lukas: And is that what you're using? Spence: Yeah. I mean, like everything in NLP these days, it's a Transformer architecture, and a pretty vanilla one too. What our team really focuses on is domain adaptation, rapid and efficient domain adaptation. So we do personalized models, either at the user level or at the workload flow level for all of our customers. Lukas: All right. And workflow means like a set of documents, so you're like learning a specialized model as the transition happens? Spence: I think the way to think about it is more from your early days, which is, anywhere that you have an annotation standard, you would have a personalized model. So if you think about it in a business, like a marketing workflow has a writing standard that may be different than a legal workflow. And so you would have different models for each one of those workflows. Lukas: I see. So you're actually training, then, thousands and more models. Spence: Yes, that's correct. That also has...that's right. So there are bunches of different models, being trained continuously in production all the time, right now. The way that you can think about what the translator does — and I think what's really interesting about this task is, in most machine learning settings, like data annotation for supervised learning, is some operating costs. You have to pay people to go off and do it. It's an artificial task — translation, you can think about them. They're just doing data labeling. They're reading English sentences and typing French sentences, as soon as they finish that you just train on it. Lukas: Right. Right. And do the models get noticeably better over time? Spence: Yes. Lukas: That's super cool. I'm curious about the technical details of just making this work, but before getting into that, I'm curious, you started in 2014, is that right? Spence: Early 2015 we started the company- Lukas: 2015. Spence: Yeah. Lukas: So you've seen such an arc in terms of ... I mean, I feel like machine translation has had, it's had such big changes, at least from my perspective. Has that been hard to adapt to? Has that helped you? Have you had to learn new skills to take advantage of it? Spence: Yes. We started the company in late 2014, and the system that we had, which we built at Stanford over the course of about 10 years, was competitive with Google Translate. In December of 2014, the first neural MT paper was published². I mean, people worked on neural MT in the '90s, but it didn't work. And so they got it to work again. There are two papers published, one in December 2014, the other one in January 2015³. And it's like, pretty promising, but nowhere near production ready. And then I think the thing that was really quite shocking was how quickly Google got that into a production scale system, which happened in the late summer of 2016⁴. At that point, our system was as competitive as anyone. And then suddenly, there was this huge leap in translation quality. We were graduating, all three of us — John and I and a third guy — right at this crossover point. So we didn't really have any empirical experience with these Neural Machine Translation systems. So we had to like build a neural MT system from scratch over the course of about six months. We went from... the Stanford system was about 120,000 lines of code that had been developed over a decade, going to a system that I think was about 5700, 6000 lines of code and- Lukas: That's amazing. Spence: I mean, it's really quite shocking. I mean, a bunch of that is like pushing a lot of the functionality down into the framework, which everything in the Stanford system was like custom-built. Lukas: I guess, 2016, what framework are you using? Is it Caffe or is it even before that? Spence: No, we wrote it in TensorFlow from the beginning. So- Lukas: Wow, wow. Cool. Spence: It was, I guess, an okay technology bet. I think there's some push to move to PyTorch, but we've got a pretty significant investment in TensorFlow at this point. Lukas: Yeah, I would think so. Were you sure that it was going to work? I mean, this seems like a really painful experience for a startup to do mid-flight. Spence: It was terrible. Yeah. I mean, you kinda just had to do it. The results were so compelling. I think that MT really is, probably of all the tasks within NLP that deep learning has really revolutionized, I think it really makes the case that MT is probably the most significant example. The recent language modeling work, of course, is really impressive, but MT just went from being kind of funny to being meaningfully good. Lukas: How did you find enough parallel corpora to make this work? Spence: Well, there's quite a bit of public domain data. So for example, the UN has to publish all of its proceedings in its member languages. There are news organizations, like the AP, that publish in different languages. There are open source projects, that GNOME project⁵, for example, that publishes all their strings in a bunch of different languages. So you can train on all that, and then you've got web crawl too, which is where most of the training data comes from. Lukas: I see, I see. It's funny, I remember working on MT briefly at Stanford and feeling like it was really unfair that Google had so much more access to data... Spence: It does help to have a search engine. Lukas: I mean, I guess if you're mostly doing web crawl, then that makes us ... I remember, just all kinds of weird artifacts from... I think we were training on the EU data that was in all those languages, and it was just such bias towards political meanings of nouns. It just seemed ludicrous sometimes. Spence: I think in an enterprise setting, that's the real value of domain adaptation. The second thing that I think is interesting is the legacy approach to enterprise, translation within the enterprise, is to just build a database of all your past translation. If you translated something before, you just look it up in the corpus and retrieve it, otherwise, you send it off to a vendor. Big companies that have been doing translation for decades have this big corpus that they've built up. We train on that too, and that customer-specific training is where you get the real improvement versus just a big general domain system. Lukas: I guess at the end of the day, how much... I mean, do you measure your results in how fast you can get a translation done? Is that your core metric? And I guess if so, how does that change with the quality of the translation? Do you get diminishing returns, or as it gets close to perfect, can someone just like cruise through a translation? Spence: Well, I think that there are... maybe I should say a few sentences about how a customer would work with us. Lukas: Sure, sure. Spence: An example of one of our customers is Intel. And if you go to intel.com, in the top right corner, there's a dropdown and you can change the site into 16 different languages, and that's all of our work. If you start looking that way, you'll see translation all around you. You'll see it on websites, you'll see it in mobile apps, you'll see it when you get on the airplane and get 10 language options for the in-flight entertainment system. That's where this can be used. Right now, it's a problem that you can solve with people. You can hire people to solve it. The problem is the amount of information that's being produced far exceeds the number of people that are being produced in the world right now. And so you can't just solve it just with throwing bodies at it. That's why you need some automation. So an example like that Intel website...From their side, what they just see is us delivering words. The only real metrics that matter are how quickly that gets done and the quality level that it gets done at. They don't really care whether it's machines, or lemmings, or whatever is doing the translation work. On our side, it's...the whole name of the game is using automation to reduce the production cost and the production cost per word. When you produce a word to give to an enterprise, there's a translation cost and a QA cost and workflow routing cost, and there's a software hosting cost, there's a bunch of different cost buckets, and it's just minimizing that. Lukas: Am I wrong, that the majority of the cost would be the human that's doing the translation? Spence: That's exactly right. So then the metrics that we care about internally have to do with making that part more efficient, but that's not something that...it translates into business value, and then it reduces the cost of what we provide to customers, and it makes it faster, but those metrics are not the same metrics that our customers care about. Lukas: Are there cases where you worry about with a self-driving car where someone... it's so good that they stop watching and the car crashes? Does your translation ever get so good that you worry that an annotator might just start accepting every prediction and quality might suffer? Spence: Yes, this is a good question. I think it's more of a risk, and this bears out empirically in the linear post editing workflow that I mentioned, where I just give you some machine output for some random machine and ask you to correct it. It's a passive task, and cognitively it's not very engaging. People tend to just gloss through that and make mistakes. Whereas in predictive typing, it's like an active engaged task. And so if they're basically cheating there, then it comes down to performance management on our part of, ""Whoa, this person did 2,000 words in 10 seconds. That doesn't seem right."" So you can monitor that. Lukas: How do your customers think about the quality? Is it like an intuitive feel for it or are they like spot checking it? Or how does that work? Spence: I think it's again, in the same realm of an annotation standard like your world, where we work with the customer to define what we call a text specification, which is, ""What are the text requirements within each language?"" That usually follows from marketing guidelines. They have their brand and style and copy editing guidelines. And then how does that manifest in Chinese and Japanese and German and French? We have a QA process where we have raters go in and rate the sentences according to that framework. And then that's what we deliver back to them. Lukas: So you don't just deliver the result, you deliver an estimate of the quality based on raters. Spence: Yes, yes. Lukas: I see. That's cool. They must appreciate that. Or is that industry standard to do that? Spence: No. There are some vendors that will implement like a scorecard and they'll give you the scorecard back with the deliverable, but we just try to keep it... we just count the number of sentences where there's some annotation error, and then we fix those, but it gives you some sense for what the overall error rate is. Lukas: Got it. I think people have pointed out that in translation, there can be ethical issues. I think people noticed that Google was...in languages where the pronouns aren't gender specific, making it ""he"" for traditionally male occupations. Is that something that you think about or incorporate into your models at all? Spence: Well, I mean...part of my work in grad school was on Arabic. When you work with Arabic corpora, there's almost all male pronouns, because it's coming from Newswire, and most of the people who are active politically in the Arab world are male. So that's the representation in the data. And so systems will tend to predict masculine pronouns for lots of different things. But then the human-in-the-loop model, you have people who are there correcting that, and they can use the suggestion or not. By that annotation, you'll get a different statistical trend that the system will start to learn. Lukas: I see. Spence: So it's self-correcting. Lukas: Cool. I guess I really am interested to know about the technical details of your system, as much as you can share. I mean, you were a super early user of TensorFlow, and you have all these models running in production. Can you, at a high level, tell me how the system works and how it's evolved? Do you use TensorFlow Serving to serve these up? How do you even run all these models in production at once? Spence: Yeah. I think the most interesting part of it is, how do you...the interesting cloud problem to solve, of which there are several, but I think the big ones are...you have a budget, if you're implementing predictive typing. You have a budget of about 200 milliseconds before the suggestions feel sluggish. That means that the speed of light starts to become a problem. You have to have a multi-region setup because our community of people who are working are all over the world. You usually hire translators within their linguistic community that are fluent in that native language, so we have people all over the world. So the first thing is it has to be a multi-region system. The second is, it's doing online learning, so you have to coordinate model updates across regions. And the third thing that I think is interesting is to make inference fast...commonly in a big, large-scale system like Google Translate, you'll batch a bunch of requests, put it on custom hardware, run it and then return it . But if you're switching in personalized models to the decoder, basically on every request, then you have to run on the CPU and you have to have a multi-level cache to be pulling these models up and off of cold storage and loading them onto the machine. So that's been...a lot of the engineering is to make it fast worldwide, and to make the learning synchronized worldwide. Lukas: You mentioned that there's like some notion of switching to PyTorch. What would push that at all? Spence: This is where my expertise, my empirical limitations run into a wind. The two things that I've heard from our team are you can prototype faster in PyTorch than in TensorFlow, and then there have been some backwards compatibility issues from TensorFlow 1 to TensorFlow 2. There tend to be more breaking changes. We've got our system running in some TensorFlow 2 compatibility mode with some frozen graphs from before. That's been a little bit of a problem. Lukas: I think one just notable thing from our perspective has been this rapid ascendance of Hugging Face. Has that been relevant to you at all? Do you use it anywhere? Spence: We don't. I think when...It's funny. When the Transformer paper came out — I went to grad school, Ashish Vaswani was a contemporary at grad school, and then Jakob Uszkoreit has been a great friend of our company — and so we called Jakob the next day, and we're like, ""Let's talk about this."" And so we talked it through, and we started working on it. It was a really tricky model to get working correctly. And it took some time. I think that paper came on on a Tuesday, if memory's served, and I think Joern⁶ started working on the implementation on Wednesday morning. Lukas: Wow. Spence: Something like that. It was like December or January, before we had a working model. And I think their tensor to tensor release helped a lot, there's some of the black magic in there that helped. So this was like mid 2017. But it's tricky to get working right in production. So I think having a library that people can use more broadly, that may not have the same internal resources to get these systems working, it's really, really, really valuable. Lukas: Totally, totally. Do you think that given your...do your latency and throughput requirements mean that your models are different at all than what a Google Translate might use? Spence: Yes. If you're running on custom hardware, you can of course afford to run higher dimensional and more expressive models. We have to do quite a bit of work with knowledge distillation to try to compress the models, so that inference is fast on the CPU. It's also been really helpful, Intel is one of our investors, and so their technical teams have helped us with some optimizations to make it run faster on a CPU, and that's been really valuable. Lukas: That's cool. Do you use different models at all for different language pairs? Spence: The short answer is yes. There's a general domain model that for every language pair that the domain adaptation starts from, and it basically just forks off of that. And then the model fork starts learning. We change the general domain models much less frequently. We just actually yesterday released new models for English to Japanese and Japanese to English, and one of the researchers has been working on much deeper encoders. I think the one that came out yesterday has like a 12-layer encoder, whereas historically, we've been running like a 4-layer encoder or something like that. So now over the the next little bit, we'll be moving more of our general domain models to some of the current state of the model architecture. Lukas: And your general domain models though, those are different for each language pair right, or is there sort of one? Spence: Yes. That's an important point. I think one of the most exciting papers in the last couple of years was training multi-source, multi-target models⁷. Google had a paper last year or the year before, where they just piled all the corpora together and trained this huge neural network. This is really hard to think about coming from the statistical MT days, because it's just like crazy to do in a statistical MT system, but we use some groups of languages. We'll group similar languages, especially if they're low-resource languages, and we don't have much data, and then you'll have a system that's for five different languages or so. Lukas: There's something about that that's so appealing. I mean, I'm way out of date, so I never saw that working when I was in grad school, but I love the idea of it. Spence: Yeah. It's a really attractive idea. Lukas: It sounds like it's actually working. Spence: It does work, yeah. Lukas: So I guess, I don't know how much you feel comfortable expounding on this topic, but I'm really, really curious. I mean, do you have a feeling on how far MT goes? Do you think that human-level MT is realistic? It's funny when you talk about companies wanting quality guarantees. I mean, I would think just having used a lot of Google Translate in my life, quality guarantees seem like it would be useful, but also just seems like the quality of Google Translate just isn't good enough that I would want to put that on my website, generally. Do you expect that that is likely to change? Spence: Yeah, I guess I can offer some assorted comments on thinking about that. Lukas: Please. Thank you. Spence: In no particular order, because I think there are both technical and social issues to do with that. And I think there's some philosophical issues. So let's start with the philosophical issue. Translation is in this space of so-called AI-complete problems, so solving it would be equivalent to the advent of strong AI, if you will, because for any particular translation problem, world knowledge is required to solve the problem. There are inputs that are not in the string that are required to produce a translation in another language. Lukas: Sorry to cut you off, but based on what I've seen lately from Google Translate, it feels like less AI-complete than I would have thought. Spence: Yes. So that's the next comment that I'll make, which is that philosophical statement doesn't mean that within business settings, you should not be using it. And I'll give you an example. One space we've been looking at recently is crypto. Four months ago, nobody knew what a non-fungible token is. How do you translate that into Swahili and Korean? Well, an MT system is not going to give you the answer to that question, because language is productive. People are making new words all the time. Machines are not making up new words all the time, people are. Philosophically, you've got to have training data for the system to be able to produce a result. People do not need training data to do that. But then I think increasingly, there are a lot of business settings where it's good enough to solve the problem. If you go...for years, you can go to Airbnb and look at a flat and click translate with Google, and it'll give you a translation. It may not be perfect, but it's certainly enough to convince you you want to buy this, rent this flat. I think there will be more and more cases where fully automatic machine translation solves the business problem at hand. I think that's absolutely true. And then I think there's a third part of it, which is social and organizational, which is, ""How soon, VP of Marketing, are you willing to let raw machine translation go on your landing page with no oversight?"" One way to think about that is, how soon are you, Lukas, ready for a machine to respond to all of your email? Lukas: All of my own email? Spence: Yeah. Lukas: Well, I have to say- Spence: Some of it probably sure, but others, parts of it a little bit dangerous. Lukas: I mean, this might be an off-the-wall question, but I have noticed ... I think I have a slightly more polite writing style because of Google's predictive text algorithm. I wonder if you're slightly shaping the translations with your predictions, even if the translator is getting involved in making it match. Spence: Oh, yes, this is called priming. It's a common feature of psychological research. One of the things that we showed in grad school is when you show somebody a suggestion, they tend to generate a final answer that's closer to the suggestion than if they start from scratch. Lukas: I mean, I guess maybe it's better that I write slightly more politely. I don't know, maybe there's some good you can do with it. Spence: Well, it's pulling your writing down the mean behavior, a mean level of performance. So I'm not sure if that's great. Lukas: Pulling down or pulling up, I don't know. Spence: Yeah, or maybe it's pulling you up to a mean level of performance, right? Lukas: Do you think that the translators learn to use your system as well? Do you see productivity going up for an individual that's doing this? Spence: Yeah. We have an HCI team, and this is one of the main things that they're working on right now, which is, I think... I remember right when we started the company, one of my co-advisors, Jeff Heer, who started Trifacta, I was telling him — this was really early on, and I was showing him some of the stuff we were building and we want to optimize this, and we want to do that — and he said, ""Let me stop you right there."" In the early days of a company, you're just trying to make things less horrible than they are. You're going to be in that phase for a long time, before you get to the optimization phase. So I think for a lot of the last number of years, it was like catching up on neural MT, making the system faster multi-region, making the system more responsive in the browser, and there was just like a lot of un-breaking work that was going on. Now we've got some pretty convincing results that the thing that we really ought to focus on is how people use the system, that the greatest predictive variable of performance is just like the individual's identity. When we look at how people use it, there's really high variance and the degree to which they utilize the suggestions, how they use the different input devices on the keyboard, how they navigate and work through a document. So the team's spending quite a bit of time on user training right now actually. Lukas: So user training not like modifying the interface, but you're training people to- Spence: User training, yeah. Lukas: Interesting. Have you ever considered doing multiple suggestions? Is that possibly better or? Spence: Yeah. One of the reasons that this predictive approach to MT didn't work really well is because the interfaces that were built up until our work, they use a dropdown box. It turns out, when you put stuff on the screen people read it, which slows them down. So what you want to do is show them the one best prediction that's the very best prediction you can show them. Lukas: I see. Interesting. I bet that's especially true when you're confident in your predictions. Spence: Yeah. Lukas: Cool. Is there any other surprises in terms of your interfacing with humans? I feel like — my last company was a labeling company — it just had all these interesting ways that the interaction between humans and machines surprised me. Has the way that you engage with the human changed at all over the years that you're running this besides training? Spence: Maybe one of the biggest things that we learned is that historically within translation...in this translation world, I mentioned this MT work goes back to the '50s. So in professional translation as a .. I don't know it predates agriculture or something, that's really an old profession, right? Lukas: Sure. Spence: So these people have been engaged with AI systems for 50 years, and for most of that period of time, their systems are really bad. There's a lot of bias against these systems, and people, especially those who used them for a while when they weren't really good, they were reluctant to try them. I think more broadly now people are using them because MT is a lot better, but we found that resistance to change was really significant, and the way to get around that was to align incentives better with the business model. What do people actually want more than they want to not embrace machine learning? Well, they want to get paid, they want to be recognized for their work, they want to be appreciated, they want to have a good work environment and work with good people. I think we found that focusing on those things, when you do those right, then people are really open to, ""Let me try this automation, I'm okay with the fact that you're changing the interface every week"" and all that stuff. Lukas: That makes sense. Is there a feedback loop with the ratings? I would think that might be an important thing too, if you're then rating the quality of the translation Spence: Yeah. We just submitted a paper to EMNLP, hopefully it'll get in. We've been working on a bilingual grammatical error correction. What the reviewers do, you can think of as another review step. So we took an English input, we generated some French, maybe there's some bugs in the French, and we give that to another person who then is going to find and fix those bugs, or maybe they make some stylistic changes or who knows what they do. That just becomes another prediction problem with two inputs, the English and the- Lukas: Corrected input? Spence: -unverified French or whatever you want to call it. Then we're going to predict the verified French. You can use a sequence prediction architecture model for that, or you can use sequence modeling for that. The team's been working on that for about the past year and a half, and they've got it working now. We announced that last fall, and we'll have it in production I think sometime in the second half of the year. Lukas: Wow, that's so cool. In production, what would that mean? Once you finish editing a document, it goes through and makes suggestions? Spence: Yeah. It's a fancy grammar checker. Only, it's a grammar checker that's data-oriented, instead of based on rules, and it can learn things. It can learn simple phenomena like spelling mistakes, but it can also learn stylistic edits. Lukas: Well it sounds like it's also incorporating the source language too, right? Spence: Yeah, so that's how it's different than a Grammarly or the grammar checker that you have in Google Docs or whatever, in that instead of you only have one language to look at, the string that you're generating is constrained by this other source language input. So you can't just generate anything. You've got this very strict constraint, which is the source language. Lukas: Do you plan to do a separate one for every single document stream or work stream that you have? Spence: Yes. Lukas: Wow. Spence: You can use the same infrastructure for that, that you use for the translation. Lukas: That's so cool. Well, cool. So we always end with two questions that I want to give you a little time to chew on. One that's kind of open-ended, but I'd be interested in your thoughts in MT specifically, is what's an underrated aspect of machine learning or machine translation that you think people should pay more attention to, or that you'd be thinking about if you weren't working on Lilt? Spence: Maybe it's around the question that you posed earlier, which is the human parity question with translation. There was a paper, I don't know, two years ago, Microsoft had a paper saying ""Human parity has been achieved⁸"", and then two weeks ago, Google published a paper on arXiv saying ""Human parity has not been achieved⁹"". I think that in our application, there's a lot to translation quality, which is the particular message that you're trying to deliver to an audience, which a lot has to do how the audience feels. And certainly in my time, in grad school, I was really focused on just generating the output that matches the reference, so the BLEU score goes up, and I can write a paper. I think there's a lot of interesting work to think about broader pragmatic context of the language that's generated, and is it appropriate for the context that you're in and for the domain. That's really hard to evaluate, but it's really worth thinking about whether it's in natural language generation, or machine translation, or whatever else. So I think, maybe thinking about that a little bit harder, I would spend some time on. Lukas: Yes, the BLEU score is funny, because it seems such a sad metric for translation. It makes sense that it works, but it just seems so ludicrously simple. I mean, at some point, I feel like it must lose meaning as the best possible metric, right? Spence: Well, people studied it a lot, and I think the conclusion was that it's the least bad thing that we've come up with. Two decades of study, it continued to be the least...nobody could come up with anything. It was as convenient and correlated better with human judgment. So maybe it's a testament to a simple idea that people are still using 20 years later. Lukas: I guess simple metrics are better than complicated metrics. There might be a lesson for business in there Spence: There might be a lesson there too, yeah. Lukas: But I guess the final question we always ask, what's the biggest challenge of machine learning in the real world? But I'd like to tailor it to you a bit, what's been the hardest part of getting these language models to work in production? You touched on it a bit, but I'd love to hear, especially any part that might be surprising to yourself as an academic, before starting the company. Where have the challenges been? Spence: If I think back to when we started the company, the research prototype that we had, you had to specialize it to one document. If you're going to translate a document, you had to compile this part of it, and then load it into a production system, and you could send it the document and it would translate it. If you sent it anything else, it basically wouldn't work. I remember when we raised money for the company, I told the investors, I was like, ""Yeah, we're going to take this prototype and have a production product in six weeks or something."" What actually happened is it took us nine months, and the problems we had to solve turned into an ACL paper¹⁰. You should not do this. This is very bad. I think I really underestimated how far it is from a research prototype, that's actually a pretty effective system, to an MVP for something like what we do, which is taking any document from any company and generating a reasonable output. And doing that with like the learning turned on, and the inference and all that stuff.s Getting to a large-scale production system, which is probably not surprising to anybody who's worked in these production scale MT systems, but the amount of detailed large-scale engineering work that has to go into that was surprising to us, I think, even having worked on Google Translate. Lukas: Well, can you give me an example. What was something you ran into? Because it does seem like that shouldn't take nine months. What came up? Spence: Well, in those days in that original system, it was...you had to be able to load the entire bitext into memory. The systems stored words as atomic strings, and you had to have all the strings in memory to be able to generate a translation. We did a lot of work on what's called a compact translation model, where you can load the entire bitext into a running production node, and the lookups happen fast enough that you can generate an output. I think in the neural setting, what's been really challenging is you can't do batching. You can't just put it on a GPU or a TPU because [of] the latency constraint that you have. That's meant a lot of work on CPU inference on the way that production infrastructure swaps personalized models off, onto and off, of the production nodes. It seems, conceptually, really simple but when you actually get down into it, you're like, ""Wow, we've been at this for two months and we're still not quite there yet. What's happening?"" And that's sort of been our experience, I think. Lukas: Interesting. And I guess at the time, there was probably a lot less stuff to help you. Spence: Yes, there was no Kubernetes, there was none of that type of infrastructure. Lukas: Awesome. Well, thanks so much. This was really fun. And thanks for sharing so much about how your company operates. Spence: Yeah, it's always good to chat with you. Lukas: If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out. If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team. We're looking forward to meeting you.",7210 +Roger & DJ — The Rise of Big Data and CA's COVID-19 Response,https://www.youtube.com/watch?v=lDsmGBWO-OI,3893,2021-07-08,"DJ: What I think people don't realize is we are all in it to get better every day. We're sharing our skills. Whether it's sharing open source, techniques, ideas, technology, that's where it's coming together. In fact, if anything, I think, this terminology, this movement is a community-based organization, just as like Roger said, open source. No individual made this happen. The community owns this collectively. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. DJ and Roger are both good friends of mine, and have been working in data and ML for the last 20 to 30 years. Roger, was, for a long time, the co-chair of the Strata Conference, and VP of Research at O'Reilly. DJ was the Head of Data Science at LinkedIn, and the Chief Data Scientist under the Obama administration. They both recently worked on the California COVID response using data, and I could not be more excited to talk to them. All right, so I have a whole bunch of really good thoughtful questions, and then one question is probably slightly annoying for DJ, so I just thought I'd get it out of the way and start there. Which is, I was telling my wife, Noga, about talking to you this morning. I was saying, ""DJ, you're the person that came up with the term 'data scientist'"", and then my wife was like, ""No, no, that's not true."" Then we were discussing it, and then I was looking it up, and I couldn't figure it out. I was just wondering if you could let me know what the real... DJ: Sure. Lukas: I feel like at least you made the term popular, right? DJ: Yeah. I think the first part to call out here is Roger actually gets credit for making ""big data"" popular versus when people were talking about data. Roger gets credit for ""big data"". I think the part about... Most people don't realize that. Roger never talks about it, but he's the guy. I remember going to an early talk of Roger's, where he's like ""big data"". I'm like, ""Who's talking about big data? Isn't all data big?"" He laid out an argument for it, and I was like, ""Oh, yeah."" Then you saw it catch fire afterwards. Roger, you should talk about big data, but I'm happy to talk about where I think the origin story of data science comes from. Lukas: Let's start with that, and then I want to hear about the origin story of big data, and then what I should trademark and what domain names I should buy. DJ: Totally. Part of the thing of data people, especially in that early era of LinkedIn and Facebook and others, was that there was a community starting to form of people getting together, and what do you call themselves. People had many different versions of names that were going on. Even going back to the '60s, there's been arguments of people where they found documentation and people titling things ""data science"". It's been floating around, and I wouldn't be surprised if we find a lot more examples of what people had been calling data science. What was also going on at the same time, as people were trying to figure this out, people were playing around with the terms like analytic scientists, and Jonathan Goldman was the guy who came up with that. Pete Skomoroch was talking about the idea of a data artist. That actually got raised at a board meeting at LinkedIn. It was like, ""Are we painting a palette? Are we creating a palette with that?"" What I recall is, and what we put in our book, was when we were getting ready for the IPO for LinkedIn, Facebook was... Jeff Hammerbacher and I both got together, and we were like, ""Hey, HR is breathing down both of our necks. What do we call people?"" We had too many different job titles, and so it was like, ""Well, what's the listing?"" What actually went through those is you start to think about the terms. ""Analysts"" felt a little too Wall Street. ""Research scientists"" was a title that Yahoo had really popularized for where the data scientists sat with Cameron Marlow and other people, but they were always pushed out to the side of the product process or the product engineering process, and so that was a little too researchy. If you go with some of the things more ""Statistician"" or ""Economics"" or any of those, you're creating a war right off the bat, but also, the term hadn't really quite caught on except for places like Google with Hal Varian's team. What we did is we went through that list. Jeff actually was the one who was like, ""Well, we're starting to think about this term 'data scientist'."" I took it back to... It was like, ""Well, that seems plenty reasonable."" I took it back to the team, and Monica Rogati actually had the idea of saying, ""Well, we're LinkedIn. We have all the job postings. Let's post all the jobs with different titles, and see what everyone applies to."" So, we did that. Monica actually constructed the test for it, and guess what, every one we hired was in the term ""Data scientist"", and so that's why it sticks. I think a lot of people have gotten caught up in this origin story, but I think there's two parts that are important. One, it exemplifies that this was a team effort. It's very easy for people to say, ""Oh, DJ and Jeff did this."" It's a community wide thing, right? This is a broad, diverse community that was all coming together to make this happen. The second is, why did it take off? Not only did we data science our way into this title, but the reason I think it takes off is because no one knows what the hell it means. I say that with great seriousness because... Roger knows this. As you watch these fields evolve, and you've seen this, Lukas, tremendous amounts through all your work over years, is people like to put people in boxes. They like to put skill sets in the boxes, and there's like, ""Oh, if you're doing data, you're not supposed to do product. If you're doing product, you're not supposed to do engineering."" We're like, ""Why can't we do it all if we've got the skill set?"" The data scientist person, people are like, ""They're smart, and they have superpowers. We don't understand them, but they really add value."" If you pull on that string of why do they add value, the reason funnily is because they're allowed in the room, and they have context. Once you have context, you can take your skills, and apply it to the problem faster than other people can. It's the ambiguity that has come out. I think that has led to the rise of the title being actually taken over. If you'd asked Jeff or me back then, I'm very confident I would have said ""No way this is going to be a thing that sticks. This is going to be really something that we really label our teams."" I think part of the reason it also took off... Frankly, LinkedIn and Facebook were very successful in their IPOs. People said, ""What's behind that?"" People said, ""Ah, Roger's term 'big data', and the whole thing that makes big data come alive are these data scientists."" I think that in my view is how we should think about where this is coming from and where... It also gives us indication of where we need to go. Lukas: Well, I guess there must have been something new going on with the social network companies of, I guess the mid to late aughts that there was a new need or something. I mean, I run, I think, a pretty standard business model at Weights & Biases, and it's really hard to imagine operating without a data science team. There must have been some kind of function before that. What changed in the requirements that it was needed to make a new role that didn't exist before? DJ: Well, let me lay out an argument from the late '90s, and then Roger should dovetail because he's seen the whole evolution of this. What I think had happened... This really started just around 9/11 time period, as people were like, ""Wait a second, there's signal in the noise, but no one's actually able to capitalize on it. How do we find the signal? How do we do something with it?"" You did see a lot of the early e-commerce companies like eBay and others actually had the equivalent of ""data science team"". They were just analytics functions in those roles, and people were called Business Analysts or other different titles. Google had a lot of these people, and had a lot of impact. I think the seminal difference that we saw, which was really building on Yahoo's research team and those kind of groups, is that the data team could actually build products, not just come up with insights. At LinkedIn, the data science team had one part, which is, ""Hey, how are we doing?"" Metrics, dashboards, all of that thing. Had another component, which is ""You're responsible for revenue. You're responsible for engagement. Your responsibility is to build things, make stuff happen."" You're a design team, a product team, an engineering team. All that comes together, and then there was another part, which is you gotta open up new turf and help things in new ways. That looks like security. Because if you're going to fight bad guys who got super sophisticated data tools, the only way to survive that is by bringing increased data science and functionality actually to bear to that. Roger, let me hand it back to you. Roger: Sure. I think what changed, and it was right around the time frame you're talking about is that suddenly, there were little companies with big data. They weren't going to go out and buy Oracle or... I was at Sybase at '99. They weren't going to go to Sybase. They needed to come up with their own thing, and the primary thing I think those big social media companies had to do was write quickly, and that meant in a distributed fashion. Then you've got Jeff Dean doing MapReduce¹ at Google to start that going, and then you've got the Yahoo people taking that idea and making it out. What's interesting is that it dovetails with open source becoming mainstream, because now you've got people who are willing to use open source because their company is banking on it. I remember talking to Abdur Chowdhury, who was the Chief Data Scientist at Twitter. This might have been 2004 or 2005. He's like, ""I wouldn't use Oracle, because I can't go into the code and fix it, if there's a problem"", and then that became a really important thing... I also think that that was the era when the best engineers in the world were really centering on the Bay Area. That's where, in some ways, when I wrote the essay about big data and stuff, I was doing those talks while using big data. I was trying to capture this notion of having to store a lot of data, do it in a distributed way, analyzing big masses of data instead of a little or medium-sized pieces of data, and this became more core of what companies were doing. I tried to get that all in one thing. That's the way DJ and I met is, I was writing a journal article with Ben Lorica on big data², and I knew Jonathan Goldman. He said, ""You should come and talk to us"", and so we did. We really liked the way DJ's team had people arrayed all across these functions that we used to think were in separate pieces. He mentioned the product piece. He had visualization people, and they were all kind of together. We're like, ""Not only is it big data, it's also a big group of people"" with these multiple functions that ended up being worth integrating and worth coordinating with. I think it was a big thing, because I've been doing data warehousing in the late '90s, and that was a siloed thing as imaginable in most companies. It was not part of the mainstream. I think what happened is, all of a sudden, you had LinkedIn, Facebook, Google, where that was what the company did, capture a lot of data and try to make sense of it to, in some ways, improve what they're doing, and in some ways to monetize what they were doing. It's a lot of incentive. And it was just driven in a whole different direction because of the open source piece of it too. I'll just add one thing about big data. There's one personal part, and then one other part. The personal part is I've worked at home for a long time, and I used to often bike to go daily shop. It's where I go get some things. Once every other 10 days or so, I had to do a big shop. I think this is just a verbal tic that I use, that there's little and big things. The other thing is I got access to SimplyHired's data, and it was huge for me. It was two terabytes, and two billion rows, and I needed help. I got introduced to Scott Yara at Greenplum. We started doing that. I know the first talk I gave where I officially use ""big data"" was around that, distributed data management and doing that. I guess the last thing I should say is I was at O'Reilly, which is famously meme generating³ as a company. That helped. Lukas: Incredibly successful. Roger: Right. I had a platform that people actually listen to. DJ: The part of there that, I think, hopefully people are also taking away is, this has been a very big tent phenomena. Abdur and Scott Yara, I mean, Lukas, you, all of us work together. People don't realize when we were first comparing these ideas of how do we use Mechanical Turk, we actually... People don't probably don't realize, we ran a test head to head with each other of like, ""How could my team do it versus you?"" We learned a lot more from each other. We ended up going with you and using you, but what I think people don't realize is we're all in it to get better every day. We're sharing our skills. Whether it's sharing open source, techniques, ideas, technology, that's where it's coming together. In fact, if anything, I think, this terminology, this movement is a community-based organization, just as like Roger said, open source. No individual made this happen. The community owns this collectively. Roger: I'll bring up an interesting adjunct to that. I don't know when it was. It's probably around 2010, 2011, but at the time, there was MapReduce at Google, and then there was Hadoop starting to make a lot of waves out in the world. The people I know at Google were very much in support of Hadoop. I think people were evolving. They're thinking about open source. The reason Google was so in support of Hadoop is that if you've learned MapReduce on Hadoop, they could hire you, that it was a way of training people. I think now, open source is a different dynamic on why people do it. But back then, that ended up being an important dynamic. I think, when you're ready to ML, where everything is open source now, is that that's the logical thing. ML tools are cool. They do a lot of stuff. They're great, but what you really need is people. The more people you get involved, the more likely these things are going to get traction and become part of the mainstream. That is why PyTorch and TensorFlow and those kind of things are, I think, in the public domain with an open source way or shareable, because what's really more important is what you do with them than the tools themselves. Lukas: It's funny. Not to turn this into a whole reminiscing session, but DJ, I remember right after meeting with you back in... I think you recently left eBay. I remember I got a meeting with the eBay CTO, which was a huge deal for me at the time, because we were selling data products. I have this vivid memory of him telling me that he couldn't possibly store all of the user data. He basically erased 99.9% of it, and just saved the little bits of the rest of it, because that's all you needed to do anything important. I remember thinking like, ""Wow, that seems so painful to erase that data. You might want that data,"" but it's funny, because now, I feel like no company would dream of erasing data. It makes me wonder how much of all this is just driven by the ability to actually store all of this data . DJ: Actually, what people don't always realize, this is one of the reasons I actually moved on from eBay, the straw that broke the camel's back. eBay obviously recovered from this, but there was a big argument from a number of us that said, ""Hey, every time we want to do something interesting, we have to go to the lords of the data warehouse, and ask permission."" To get something done took months, and it should... It was pretty easy, obvious stuff. One of the things that... I remember this meeting very clearly is a number of us had this technical session. We basically said like, ""Look, the bet for us has to be Hadoop. There's no other way. We cannot sit on traditional infrastructure and do the problems that we need to compete. It's business critical."" Those ideas got pushed out, and effectively, all those people that were on that mission of doing it all left to other things. Chris Riccomini, one of the key people behind Kafka, was one of them as well. What it showed is... I think this is something that companies need to grok with respect to machine learning is that there are paradigm jumping moments. If you don't jump, you will have to jump later, but you're going to be so far behind the curve. eBay obviously adopted one of the biggest Hadoop clusters with Cloudera. Seven years later, five, seven years later? But they could have been so much more competitive and done so much more. It strikes me that there's a similar moment that is happening around machine learning that if you don't get on the bandwagon now, you're late, if not already late. Lukas: Interesting. I mean, of course, I would agree with that. I actually had a question. I had a question written up for you, Roger, that's maybe poorly formed, but I feel like you had front row seats. I'm not even sure if it's completely true, but it feels like there was this massive shift from Hadoop to Spark maybe five or six years ago, and it seemed like it was slow. Then all of a sudden... I was wondering. It seems like you just really saw that. I was wondering what you think about that and if there's some fundamental problem with Hadoop that they could have fixed, or if there's something coming beyond Spark. I was really curious to get your take on that. Roger: I actually have strong opinions on this, but they'd be easy to try to puncture holes in. I think Hadoop was a write engine, and that people needed a read engine. The fundamental early problem was the one you guys just talked about with eBay. How do you get all this stuff to disks? Well, distributed was the way to do it, but it was slow to get stuff out and things like no schema, that ends up, isn't really very good if you're trying to do any analytics on it. I remember I got a lot of arguments with people, where they were telling me MapReduce is the way everyone's going to work. I'm like, ""There's just no way that that's going to be the case. There's just too much embedded SQL. SQL is very productive."" In a place like the Bay Area with its high concentration of great engineers, a lot of people are getting MapReduce, but a lot of people weren't getting MapReduce. Back when I was first learning, I have a lucky thing, because I was working with Greenplum. Joe Hellerstein used to come to my home, and we'd go through MapReduce problems as they were trying to put in a MapReduce part of it, but my sense was that it was going to be SQL that was going to win, and that the analytics, instead of just storing stuff which is like step one, it's the analytic support that really mattered. Spark was just better at that than Hadoop was. I think it was Impala was the first time that really SQL was available. Spark came right away with SQL. The other thing that happened, and this is just kind of an anomaly of... not anomaly, but just one of those harmonic convergences, Python was starting to become just a de facto language right around when Spark had a Python binding. That meant a lot more people were just able to get into and do the work that made sense. Also, just part of it was Spark being in memory. It was just fast. As long as you were able to make your RDSs, then eventually, the more table like things in memory, then you could run really fast queries. I think when it comes down to all this, you'd mentioned the kind of disconnect, DJ, between your getting data at eBay, and having to wait for it. I was running the equivalent organization at Sybase. I had the data engineers and data scientists right together. Of course, they weren't called data scientists. I ran the group. I did both things. The reason we did that is so that no one could complain that it took months to get anything, because I wanted to keep everything tight together. I think, go forward to when Spark started coming out, you were able to actually do data engineering and data science work all in the same platform. You didn't need someone to pull all this stuff for you. You could do it yourself, because it was SQL, and you could just pull it in, and go through the whole thing up to even early ML stuff at the time. DJ: The other thing is Spark is cheap. One of the things that people... The eBay team had some amazing technologists. They're all that Sun generation of deep, deep infrastructure thinkers, and so they used to have a TIBCO bus, and they had basically stream processors sitting on top of it. Except it's very expensive, just because of just the structural constraints of the time. With Spark, one of the things that was beautiful about it that we saw is like, ""Wait, we can have a stream processor finally? We can actually do computation without having to wait and doing all sitting behind the ETL pain?"" That gave us a massive leg up on a key set of problems, mostly that were time-bound, like fraud and security issues. That was natural to gravitate to versus the Hadoop frameworks, the MapReduce frameworks. The other part is I think Roger's pointing out, which I still think is there, is a lot of people want to work. We saw this for Kafka also. It's like, ""We were going to put the logic layer on there,"" but it just takes so much time of development, even with the open source community to graduate these things, and Spark didn't have to worry about the underlying buzz. The part there that I think that we're seeing is data has moved into a space of just the background view. You've got specialized tools, right? For depending on the team, you're going to need different things, because most people who work in MapReduce, that is a leap way too far for most individuals and teams, especially when you're bringing in fresh talent from other disciplines or other areas. Roger: I just want to bring up one...This is a bit of a corollary to this stuff, but when things first started — I know Lukas, we haven't even brought up Math Club yet, Math Club in San Francisco — diversity, cognitive, physical background, all this, is something that leads to really a lot better outcomes. I think that that's at tension with things like MapReduce, which are exclusionary and are really geared towards people who are really technically adept, that the companies that are really going to do well are the ones who can bring the tools out. I'm not talking about just democratizing data, because I've got a really clear issue around too much democratizing data, but getting people who can go into the data and figure things out and having a lot of different perspectives on that is really going to make a big difference. I think that that means having tools that more people can use to get there matters. I think when you look back at the aughts to maybe the early tens, that they were still pretty hard to do, and that now we're at a place where a lot of people can spin something up, and start to make sense of it from all sorts of different backgrounds. Lukas: That's a great point. I wanted to go back, Roger, to an earlier point that you made before I forget, which is you made a little bit of a... I don't know. It seemed like you're a little bit dismissive of NoSQL databases, which is ironic, because I learned about NoSQL databases in your Math Club, and I've continued to use them for the last 20 plus years since then. I was actually curious, do you think that it's generally a bad idea to... I mean, of course, everyone uses them now for some functions. Roger: No, I don't think they're a bad idea at all. I think they were not a replacement. They're good for what they were. I think that the main argument about schema-less was...that was a really terrible argument. If you want to make sense of data, you probably want to have it organized in a way that people know. When you think about analytics, it's a combination of things, the combination of data, the tools you're using, and the person who's using the data. The more that the person can know about the data, the less cognitive load on them to get into it, the better they're going to do. Having to deal with different schemas is not a way to promote that. In fact, what you end up promoting is someone who's got this photographic memory, rather than just a broad memory. I mean, it's like the way JSON is clearly the way that most people move data around. I'd much prefer getting CSV data, because it's organized in a way, and you're not going to have the overhead of tags and stuff that are telling you what everything is. You can move right into... What usually I want to do with the data is trying to make some sense out of it. I'm only dismissive in it as a pure replacement. It's like a lot of things. When it's the right tool, like you've got a lot of text. I know SQL database is great, but for plenty of things, I want a key. I want a primary key to... I mean, I have something to dedupe against. I just ran a big deduping project for the state of California around Homebase's timecard data, and they gave me stuff, and I had to dedupe it. It took 19 steps to dedupe it to try to make a primary key that I could use, and pick the right one when I had multiples and stuff. I think this stuff matters. I think that... I'd love to hear someone argue the other point, but that you end up with things kind of messy, which was maybe okay, but you end up having to build taxonomies and the kinds of things that help you make sense of data. They end up looking a lot more like tables than a schema-less thing. Lukas: DJ, do you have thoughts here? DJ: I'm with Roger. I think one of the things that we've seen with a tool that is being used for many other things, you end up building a lot of scaffolding or process around it that then suddenly is like, ""Hey, there's data dictionaries for this, and there's manuals and wikis to help you get through the schema-less world, and you're just like, ""Did we just put a schema structure that's just meta around this?"" Roger and I both had the opportunity and good fortune of working in California on the COVID response. There's a lot of really dumb, boring, unsexy problems that are the real rate limiter of progress. People are very apt to saying, ""Oh, there's another data source. We'll grab it. We'll put it in,"" and then you ask ""How many people are actually ever looking at it?"" It's zero. You go around, and then you say, you look at the requests that people have, and you're like, ""Everyone's requesting this data. How come no one's looking at it?"" You go, ""Oh, this is actually a comms accessibility problem. We're trying to solve this with all this machinery and everything else."" Literally, in the California COVID response, do you know the thing that changed the game? Myself and three other state people, we wrote a data dictionary in Excel for like, ""Here's all the data that we have."" We just sent it around to all the different departments. We're like, ""This is what we have. Here's where it is. If you see something new, or you want something, here's the new process. We're going to go super old school, and you can print this out. You can share it. Here's all the data that you want."" People can flip through it and be like, ""Oh, I need COVID case counts by this. Oh, great. It's already in there. Sweet, ready to rock and roll."" Those things move the needle more than just having this brand name data warehouse or super other cool stuff or dashboards up the wazoo, because they don't get looked at or utilized. I think one of the things that, I think, I find myself saying a lot is ""What problem are we trying to solve, and does this actually solve the problem?"" I suspect this is true for all of us. We've been in plenty of times where people are like, ""The problem you're trying to solve is not the problem you really have."" Roger: Mmm-hmm (affirmative). DJ: Go for it. Roger: I have this thing where I often tell people, ""What does paradise look like?"" That's the question I ask. Then they give me... paradise isn't clouds and harp playing, but the business plan they're trying to solve. Then they go, ""Okay, how do I step through? How do I get there?"" Then that process leads into what kind of data you might use. As you were saying that, DJ, about the data dictionary that you did, I mean, I think that's really important. I think there's some... If we want to get into this, there are some fundamentals that people forgot about, but I think are worth reiterating to put things in a more productive manner. But at the bottom of this list I prepared for this is ""Put human perspective first."" Maybe I should have made that the top thing, because I think what it ends up is we start thinking about the math and all that and biases and everything that's part of this, but it ends up... It's really a human process, and what you're really trying to do is get humans to give them the cognitive capacity to make better decisions, or at least to make decisions that are informed in a way that they can then learn from what they've done, and move forward from there. Lukas: Well, I want to hear this list of best practices, but that reminds me of one of my favorite memories of you, Roger, which I don't think you... I don't know if you remember this. I don't know if it made such an impact of you, but I was late to meet you at our first office for Weights & Biases when it was six people. I remember you were like telling our Head of Product, Carey, you're telling her, ""Basically, nobody wants data visualizations. They want insights."" It's funny, because our tool is mainly of...data visualization tools is one way of looking at it. She was nodding in agreement. I was like... She was thinking about taking all the graphs out of our products. I was like, ""Oh my god, I'm five minutes late."" This is already happening. Roger: I remember that. That actually is on my note. I think we were just hanging around, and trying to make a point. One of the points is when you've got KPIs, when you've got someone who's in the data every day, and they know what they're looking for, you need a dashboard. You need this kind of visualization. But when you want to communicate, and I liked when DJ used the word ""comms"", you need narration, you need annotation. A dashboard won't do it. I can just give one example from the state. They had mobility patterns for every county in California. There's 58 counties. There's 58 little charts arrayed in a lattice. Alpine County in California has 1,200 people. There's high schools bigger than that, and that showed as big as Los Angeles, which is the second biggest county in the country. That was not telling a story. That was just going to confuse people. The point I was trying to make when I was at your office was more around that, that you need to include the things around narration and annotation. Again, bringing the human part in so that you can make sense of it and to show what's going on. If you've ever seen me give a recent presentation, I use the lipstick mode, and I put big red circles around the things I want you to pay attention to. Then the slide appears with ""Tada"", then that comes on, and then I say it so that I'm trying to peg it a little bit into your memory. DJ: I think this gets actually to a pet peeve. Roger and I were talking about this some time ago, which is my biggest pet peeve... Roger, I'm curious your reaction to this. Somebody's like, ""As you can see. I'm like, ""I have no idea what you're talking about. There's 58 lines. What are you talking about?"" Then they're like, ""As the graph shows,"" and you're like, ""I don't know what that means even."" People love these things, and you're just like, ""Where the hell is...Tell me..."" We have a saying in National Security, BLUF, Bottom Line Up Front. Lukas: Nice. DJ: Tell me what the bottom line up front is, and then I can get there. But if you're taking me on this journey of literature, I don't have time for that crap. Help me understand. Can you go to the president and be like, ""Well, let's go on a data journey together, and let's talk about how we got here."" No, bottom line up front, and then figure out if they're interested, how do you get them to the richness that helps get another level of understanding? Roger: There's a thing that I always tell the analysts who work for me that you can at best communicate four things plus or minus three. You got one to seven things that you can try to communicate, and you should say them upfront. The BLUF-ing, you can go through it, and then say them again at the end, and save the detail for later. I think what's hard for a lot of analysts is that for them, the story of how they got to where they got is pretty interesting to them. It's really the insight that you really need to... DJ: ""I joined this table, and then I did this"", and then everyone's like, ""Oh my God."" Roger: They'll say something like, ""And I forgot to do a left join."" DJ: We don't want the GitHub repo. Roger: That's what appendices are made for. You throw that stuff in there. Lukas: I wanted to ask you about... I mean, you've both mentioned your work with the COVID response. I was thinking about COVID, and I feel like it's maybe the first time in my life that I feel like I've really consumed data visualizations from my government. It does seem like starting to kind of get this communication in graphs and charts that are reasonably good and seemed to be well thought out, but I was wondering what problems you were trying to solve, or what were the big problems that data could solve with COVID and our government? DJ: Well, I mean, maybe I'll start, and then Roger, you want to layer in because you picked up a lot of the baton from me in our first wave, and then took it far further. The way this happened is our intention wasn't to go up to the Capital, and just be like, ""Look, we're here."" In fact, what it was, was we just happened to be on a phone call with a friend who is helping out at California, was actually a state employee. We're talking and they said, ""Here's what we're thinking about data."" As I remember saying, ""Well, that's not what I would do if I were you."" They're like, ""Well, what would you do?"" It's like, ""Famous last words."" In a couple hours, I wrote a memo. I said, ""Here's the way I would frame it. Here's what I think is doable. Here's what's not, and here's how I would organize things."" Next thing I know, 24 hours later, we were driving up to Sacramento at 5:00 am to meet the team and start jumping in, and then we were up there for about 100 days. The first part of it was, remember at this time period, there was no data. People think there was lots of data. We had data that we weren't sure we could trust out of Wuhan. We had data off two cruise ships, and a little bit of information when we were able to call our friends who could connect us to other friends who were physicians in northern Italy. That's all the data we had. There was all this talk [about] epidemiological models, all these things. There were no models that were like, ""This is the gold standard."" There's no weather model for this. We were able, luckily enough, to have... The story actually, interesting enough, is being super detailed by Michael Lewis and his book that's released today⁴. Then we had this amazing woman, Charity Dean who is working on things. We had another guy, Adam Readhead, who is another public health official, and Amy Tong, who's running Information Technology for the state of California, amazing human. These are real...these are people we should be grateful for. What they were putting together was like, ""Well, what is the model?"" We found that one of the models... Everyone was looking at the models for all of the states. The model for Delaware is the same as the model for California. That doesn't make sense, and that doesn't help us think about how to think about LA, San Francisco versus Alpine county or Tahoe area. We needed a more sophisticated one. Luckily, there was a research scientist named Justin Lessler out of John Hopkins, who had a pretty sophisticated model. This model- Lukas: I'm sorry, this is a model of what? What are you modeling? DJ: This is modeling... It's basically a set of differential equations, and it says basically, a person's... You start with the population. You sprinkle some base conditions of those that are infected. They, at some percentage, get other people to be infected, symptomatic, asymptomatic ratios. Some portion of them will die. Some portion of them will survive, and then that's it, super simple. Now, you need other things in there like, ""Well, what about people who commute between the Bay Area and LA? What about different age demographics? What about closing schools? What will that do?"" They had started to build more and more sophistication in the model, and so you could run an ensemble, many, many scenarios. The only problem is this was a research thing, so it's running under somebody's desk. Luckily, we were able to call on Sam Shah, who really deserves a lot of the credit for scaling people you may know at LinkedIn. Jonathan Goldman and Mike Greenfield really came up with the ideas. Sam Shah scaled it and made it really a machine learning platform. Josh Wills, who was at Cloudera and then at Slack figuring out how to make digital engineering work. The two of them with Justin Lessler's team and with a massive help from Werner Vogels and the Amazon team took that model and ported it over in a matter of days. Now, we're able to run hundreds of simulations. Those simulations are what led to those first graphs that people saw of the exponential curve that were shown on press conferences by Governor Gavin Newsom. That also led everyone to see like, ""Holy cow, if we don't get this under control, here's where our bed capacity is right now. Here's our bed capacity if we put them in parking lots and do everything, and here's the curve. A lot of people at that time were like, ""This is garbage."" They didn't see what's happening literally in India right now. They weren't seeing what was happening in New York. That model, that effort of a combination of data scientists, data experts, and technologists combined together with the policymakers, that's what led to the state order on saying we need to stay at home, because there's one goal. One goal is to preserve the healthcare system for tomorrow, because the physicians get sick or die, you don't get that back. That model, those efforts with Governor Newsom and him, is what allowed other states to realize that they need to take action as well. That's what led us to the follow-on orders, and actually being able to make sure that we didn't have happened what we saw in New York happen in San Francisco or in LA, even though LA was still hammered. From those efforts — and there was so much more data than we sort of realized — that was just one part, because then it was like, ""Hey, how do we help policymakers with richer, deeper understanding of ideas?"" We had to bring data in and draw insights. Luckily for me, one of those people who answered the call on the first ring was Roger, and Roger said, ""I'd be happy to volunteer."" Because we're all volunteers, no one was paid. This was all volunteer all the time, and we're all just trying to do our best. Roger, you should take over because you led the next portion. Roger: One of the interesting things that happened is what came out of that model was the need to look at mobility data, and so we started getting a lot of updates about how people are moving around. We noticed some things about that that ended up leading, and I think this is what's so interesting about it, is the data led to thinking a lot about ethnography. It ended up being behavior that mattered, and then turning that behavior into something we could do. I'll just give one quick example. In the spring in Los Angeles, there is a bioluminescent event⁵, so these algae glow, and people want to go to the beach and look at them. People brought this up, and we're like, ""Well, what are we going to do about it?"" Somebody's like, ""We gotta keep people off the beaches. We gotta keep people off the beaches."" It's like at that time... This is April, I think like late April, early May. I think enough people on the team knew it was an aerosol, and that spreading apart was okay. I mean, but this is my remembering. It might have not been as clear as it seems now that we know for sure it's an aerosol, but that's how I remember it. One of the things we did is like, ""No, we're not going to keep people off the beaches. Let's keep people safe on the beaches."" Here's what we had. We knew that people were moving around a little bit. I think at that time, there was a little bit of upward movement, and so we told people in Los Angeles that what you need to do is maybe have some people to keep people spread apart and stuff. Of course, there weren't going to be as many people anyway that's doing that. We started doing stuff like preparing for Thanksgiving in August. What do you tell people? The harvest was a big... The real boost in California's rates came because of the harvest, which was a perfect equation for how you're going to get infections in a community. I don't know if you remember it, I think Imperial County at one time had the highest rates in the world, and it was totally because of the harvest. What we ended up doing is using data to try to communicate to the people in the state, and to think about behavior things that then we would maybe build new...and they really weren't models, they were really just studies about what we could do or what was happening that we could intervene better. We started going from where everything was statewide, at first, to talking about rural versus urban, because there's very different things going on depending on the density and characteristics, and trying to also learn things like... The Bay Area did better than the rest of the state. I think that the trivial reasons why really ended up not being the reasons why. The trivial reason was people could remote work easier, and an educated population, and it ends up... This isn't something we found, but a lot of it had to do with the Bay Area's experience with AIDS, and having to deal with another pandemic. That was community, community access. As we started seeing mobility data showing more movement, we brought in someone from New Haven who had done some really interesting work around, ""How do you deal with that community part?"" What I like to think as what happened is the basis was laid with data, and then we were using that to go to this next level of mixing that data with some qualitative behavior. We had some ethnographers on the team start doing a lot of surveys, and we would use those surveys to... In the end, DJ, I don't know how much of you were involved towards the end of this, but really, the surveys ended up being the driving force behind comms that we're going afterwards, which is, again, another qualitative thing, but we made sure that the surveys we're doing were being a better instrument for pulling stuff. This is one of those lucky breaks. I happen to run all the surveys at O'Reilly, so I had some survey experience, and we were able to bring that in and improve that. DJ: Kara DeFrias really deserves credit for having the survey idea. What she did is she basically convinced the state to basically put just as an open-ended set of questions on one of the highly trafficked webpages, and it just sampled. It was just a way of getting feedback, but the problem is it's very hard to get a feedback on a state the size of California, just given the disparity. One of the things that was prepared every night at that point was basically a briefing book for the governor and the key staff. It had charts and graphs and all sorts of really important key insights. But also, it had snippets of key things that we heard from the real population, real stories. These weren't data points anymore. There were people. They had names. They had ages. They had stories, and you read those things, and you could feel the fear. You could feel the pain, and so no longer could you just be like, ""Well, it's an uptick. We'll see what happens."" No, that uptick destroyed a family. Like, ""Oh, it's just the harvest."" No, we're about to destroy a community. What are we going to do about it? It changes the whole narrative and approach you take from just being a data science thing, and thinking about this in abstract and playing with graphs to, ""If we do not act right now with immediate sense of urgency, somebody will die."" It's not an ""if"", someone will die. Our actions directly help the shift in balance of who that is, and how do we make sure that they get the best shot at surviving? Our job fundamentally was to use data to give everyone a shot at living. If a hospital doesn't have oxygen, figure out how to get them the oxygen, so those people have a chance. If people don't realize that the people around them are highly infected, let's give them a shot to actually be able to take safety measures in their own hands, so they can survive and increase the probability that they are okay, or some other family member that they may expose will be okay. Roger: I just want to say one thing that was really striking to me about it. Obviously, this kind of thing is in politics, it's in the realm of politics. This group of volunteers went out of their way to always treat every group, every person, as worthwhile. There was really no politics in the traditional polarizing way we think about it going on. It was always about how to keep people safe and how do we... Mostly, it was about how to tell them the information they need, that they can try to be safer and do the right things. It was really quite, I don't know, enervating to see that going on with it. I guess this relates back to my earlier point about the annotation narration is that we ended up moving from learning from the data, and then moving to this alternate approach that I think ended up being effective downstream. Lukas: Do you have a sense of how effective this was when you... I mean, it sounds like there's a whole bunch of different kinds of interventions you were doing. Is it possible to even know if you hadn't done these things, what would have happened? Do you feel anything about the overall effectiveness of this response? DJ: There's a blog post that Sam Shah wrote that talks about this, and there's been a whole lot of estimates. I think we'll continue to see estimates and a lot of people doing deep analysis for decades to come. I think I've received a fair amount of criticism, and it's okay to receive criticism about what people would describe as a very strong policy response, and that we were too aggressive in shutting down the economy, and taking the action we did. I actually sleep well at night, knowing that we took the strong action that we did, because if we didn't... I mean, I was in contact very regularly with friends who are on the front lines in New York City. I was on the calls with people who were in the ER who were showing me how they were... Just like in a kindergarten, they had a wall with paper brown bags that you put your mask in, because you just need to come back and reuse them. People forget how many physicians and nurses and janitorial staff, the people who we don't often think about in the health care system, that died in service. When you lose that capacity, you don't have the capacity to get back up. You don't have a... There's just no one else to take care. What you're seeing happen in Brazil, what you're seeing happen in India, that could easily been us. People think, ""Oh, we're good."" Remember, there was no Remdesivir. People didn't even know about ventilators and how... Do we flip a person over? Do we sit someone up? We had no information. We had a Slack channel that was created, literally, for physicians just to share information from one group to another about what they were learning. That's how little information we had. Now, what's behind this? A decade of under-investing in public health, more than a decade. Did we have to end up this way? Absolutely not. This is an abject failure of literally 20+ years of not investing. President Obama called for a massive revamp of this after Ebola. We saw this with MERS. We saw this with SARS. We've seen this many times over, and people often think like, ""Oh, we're through the pandemic."" This is not pandemic flu. This is not pandemic tuberculosis. This is not the next Coronavirus, which will show up. We will expect another Coronavirus. I don't say that to be a doomsayer. I say it as these are the systems that we need to get into place now to be ready for what's next so that we don't have to just say, ""One size fits all, shut everything down."" We can be smart, because we are going to have to create this as a knob, if you will, of dialing things open, dialing things back, depending on what we're seeing in which community. A lot of this is going to be really tough because it's socioeconomic also. As Roger pointed out, you get very different dynamics from one region to another. Lukas: As an outside observer, it hasn't felt like the levels of COVID the different regions saw was exactly correlated with the thoughtfulness of the COVID response. Do you think that's fair, or is it just that the data is noisy? Am I missing something there? DJ: Say more. I don't think I fully understand. Lukas: Well, I guess it seems to me... I did not prep by looking deeply at the data, but I've had this sense that some states that really aggressively put in controls, maybe, or the states that put aggressive controls in quickly, sometimes, they ended up having more COVID cases than states that seemed to ignore it. Some states, it sounds like California had a really thoughtful response. It seems like some even governors are like, ""Hey, this isn't even a problem,"" so I can't even imagine how that state can have any reasonable response, when the leadership doesn't even believe that there's an issue when there obviously is. I guess, it seems really hard to know how much the interventions really mattered. DJ: Right. I think there's a bunch of stuff. Then Roger, I'd love to... Let me just be real quick, and then you should go ahead. The first is, we're still scratching the surface in our understanding of COVID. I think a lot is still going to be learned. We now know as we were in the summer, we were really worried about protests that were happening. We were really worried at Sturgis Rally⁶. We thought, ""Oh my gosh, these are going to be super spreading events."" It turned out we dodged a bullet. I think people are like, ""Oh, you're wrong."" I could look at it as we dodged the bullet. Because if that was a highly contagious, more like measles, we'd be in real trouble. The other part I think, which is there, is one of the things that happened is because of the actions here, a lot of people did start to take COVID very seriously on a personal level. They go, ""Oh, this isn't just some... California is taking this action. Maybe we should take it seriously."" But the other example of this is the version that you're seeing in India, which is they stopped taking it seriously. They started whole big political rallies. They gave away their vaccines, and they're not shutting down still, and then people are partying and other aspects. That has led to the spike that is...it's decimating, because there's no path out now. What you see in the COVID numbers today is a reflection of four weeks ago. Roger, I'd love your... Roger: Just using the machine learning language, there's a lot of features that go into what goes on with this. The protests ended up not being super spreader, Sturgis was. Look at what happened in the Dakotas, right? Those are places without a lot of cross mixing of people because they're relatively isolated and remote, and they have the worst case loads in the whole world. It's not clear that it was totally Sturgis, but there's a lot of thought that it was Sturgis, and that their political response was not very strong. Now, those are states with less than a million people each, so the impact isn't as great, but there was a different response, and they had a much worse result. Now, what happened in Florida, I think, can be looked at differently. I'll just bring up... I know this isn't supposed to be a political discussion, per se, but I have my parents who live in Florida. People have self preservation, and I think that there was enough knowledge out there that, at some point, with or without these, whether the government was intervening or not, enough people were trying to play it safe, and we're doing the right things. Yes, there were people who were in that kind of obnoxious way about individual liberty and stuff like that. I always want to take one of those people and say, ""You need to talk to my friends from Taiwan,"" because anyone from Taiwan who went through, I guess, with SARS, they, as a collective group, knew how to act and do the right things. It's giving up some things that feel like a freedom, or whatever, seem worthwhile in the long run to tamp things down. Of course, Taiwan had a much better response than others. It's hard to take out the American context and how people behave. There's plenty of examples that would support one position or the other, but personally, I'm more comfortable with the intervention. I mean, I realize that it was really bad for the economy. I think things like maybe the way schools were opened and so forth could maybe have been handled differently. We've got a lot to learn. Lukas: Well, I really appreciate all the work that both of you did. That's actually a segue to a question I really wanted to also want to make sure I ask you, which is for someone just starting their career in data science, maybe most of the people in that situation that I've talked to these days, they really care about doing something meaningful, maybe getting involved in public sector stuff. What advice would you give someone maybe just graduating now who wants to do interesting work and have exciting careers like both of you have had? Where do you tell them to start, I guess? Roger: That's a good question. What I've been telling people is to remember this human side of things, and don't get too lost in the numbers. This is more like... This isn't quite career advice, I think, what you're asking, but also, you've gotten a bunch of tools that are pretty cool, but that doesn't mean they're applicable in every case. Always work your way up from, ""Is there a simple thing that will work? Well, how far does it get to you,"" and then work from there. Then when you need this sophisticated tool, and when it's worth it, to jump in with that. TensorFlow is not the answer to every classification problem. There's other tools that can work really well. But also, I mean, just find things. I think I just saw a tweet today, DJ, from Rick Klau that the state is hiring. If you are trying to do some good things...like I said, I was so impressed with the people of the state and their attitudes about trying to do the right things and being good for all Californians. That's a great place to start. DJ: The place maybe I would go with, because I agree with Roger on all of this, no surprise, and hopefully what people have taken away from listening is this is a team sport. The amount I've learned and grown from you, Lukas, from Roger, from the people that you've introduced me to, the people we've all hung out with, the thing that you want to do at the early part of your career is be around amazing, awesome people. Be around awesome. If you're around awesome people, you'll become awesome, too. You may feel like you're an imposter to start, and you gotta figure out how to shake off that imposter syndrome. But if you're around amazing people, they're going to carry you. It could be on a public sector. It could be in the private sector. It could be in a hybrid sector, but I fundamentally believe that if you're around those great people, that's what carries forward. I've been so fortunate early in my career being around amazing people in academia, then being around amazing people in public service the first time after 9/11, then coming out here to Silicon Valley meeting all of you, and being exposed to that group, and then going back into government back and forth several times. Each time, we're able to pull in amazing. The thing that people don't realize is...people ask me all the time like, ""Why do people pick up the phone when you call, and then why won't they pick up for someone other?"" It's just because I'm trying to do it for the team. It's a ""we"" approach. I'm not trying to just further it for one perspective. I think we've all had that philosophy, that this is a collective movement. I will go on the record saying this, which is I get way too much credit. The credit belongs to the community. It belongs to the teams, all these people. I've just had the good fortune of being in certain roles that get to shape certain things, but those people have also shaped me. They're the ones that have helped make me into what I am, and help make that happen. If you're early in your career, and you can find a place where you're learning at three to seven times the rate of somebody that's just in a regular job, you're going to do fine. Seek out those places. Don't optimize for a salary. I'm not saying it's not important, but optimize for learning. Your first derivative, your second derivative, should be highly positive on your learning experience quotients. Roger: I want to focus on a particular part of that, because I completely agree. This talk that I give is my general talk about data topics. It starts off with humility, and humility is a key to learning. I also will tell anyone that says, ""I need to hire a data scientist."" I said, ""You don't need to add one. You need to hire two."" No matter what you're doing, you need to be paired with other people. In terms of finding an opportunity, I think you gotta make sure you're not siloed. I want to give a particular example of what I think happens. When you get into the data, it's almost like a scientist looking at the universe. You say, ""The universe is my data,"" and without outside perspective, you don't learn. The data almost in a way stops your expansion, because that's all that you can see, but there's a lot beyond that. I know Ben Lorica, who is one of the people I was lucky enough to work with, who taught me so much, he's a real math PhD. I didn't have that background. We did not release anything without the other looking at it. We were on the phone almost every morning talking about what we were doing. When you're looking for these career opportunities, make sure that you're not going to be siloed, that you're going to have as much opportunity to work with other people, almost like in a peer programming way. Look for that. Look for companies where you're going to be able to talk to other people in the organization so that you're getting all these things that DJ was talking about: the opportunities to learn from amazing people, and just picking up little things like DJ's story about the data dictionary working so well — the next job you go to, and there's no data dictionary, you're going to make sure there's one there — picking up on those kinds of things, because they can be so effective. I think just making sure that you're like an octopus, and your career move is your tentacles are all over the place. Lukas: It's funny you both say that. I mean, I totally agree with it. I think it's one of the reasons that... This is totally shameless self promotion, but I really think it's true. We've really tried to build a friendly, smart, but really inclusive community at Weights & Biases with stuff like this, where people can meet smart people that they might not otherwise have access to based on luck and geography, and so I just really encourage people to engage with our community. If you're watching this, you're part of it. We love answering people's questions, and hearing from people, and hearing about what they're working on. Anyway, I just totally appreciate you guys coming on and talking and answering my open-ended questions, and also appreciate all the work that you've done throughout your career. It's been inspiring to watch, and clearly directly connected to a lot of good in the world. Thank you. DJ: Thanks. It's been fun to catch up. Lukas: At Weights & Biases, we make this podcast, Gradient Dissent, to learn about making machine learning work in the real world, but we also have a part to play here. We are building tools to help all the people that are on this podcast make their work better and make machine learning models actually run in production. If you're interested in joining us on this mission, we are hiring in Engineering, Sales, Growth, Product, and Customer Support, and you should go to wandb.me/hiring, and check out our job postings. We'd love to talk about working with you.",11307 +Amelia & Filip — How Pandora Deploys ML Models into Production,https://www.youtube.com/watch?v=cssfNH_2qrM,2449,2021-07-01,it's a balance that you have to find between having enough novel content and knowing which users like more novel content and which users prefer to hear the same old songs you're listening to gradient descent a show about machine learning in the real world and i'm your host lucas beewald phillip is a scientist at pandora working on recommender systems before that he was a phd student working on deep neural networks for acoustic and language modeling applied to musical audio recordings amelia naibaki is a software engineer at pandora where she runs a team responsible for the production system that serves models to listeners i'm super excited to talk to both of them hey maybe i could start by asking at pandora what do the machine learning models actually do and how would a pandora user experience the results of the models well we're a big company so like serious index and pandora we have a lot of different models that spread across all the you know product features almost every product in almost every digital product has some science background or some science model uh powering it so we have like internal for content understanding we have advertising models we have of course our main feature at panda our recommendations we have a lot of work done in recommendations and figuring out which songs to play next and of course like having so many models interact with each other makes it you know quite a complicated scenario so we need really powerful and strong engineering to make everything work smoothly and prevent outages and stuff like that so my work mainly focuses on the musical side so i work for basically the algorithmic programming team that focuses on creating the radio experience and i what i do is compute you know similarities between artists or tracks so given this track which tracks are similar and so on and like working at pandora is like the best place to do that because we have like this awesome data that nobody else really has so we employ a lot of music analysts who listen to who really listen to the songs and annotate them manually so this is like the dream of every phd student in this field like oh wow i've actually experts looking at the data and annotating them and also like our users provide us feedback and over time that we collected over a hundred billion thumbs up or thumbs down like ratings if users like the song or not so we have very detailed and strong features on one side and very nice explicit feedback by users on the other side this is like the perfect scenario to create very very powerful models do you have you have a model that's sort of explicitly trying to understand the similarities between songs yeah exactly that's like one of my uh projects i worked on last year basically like this model connects like we had that all from the beginning like a model that tries to understand which is which songs are similar but of course now with the deep learning revolution we tried to replace the old models with more sophisticated neural networks that try to take the features we get from the music analysts and map them to what people think like what people think about similarity because it's not obvious like which musical features actually make a song sound similar right is it the tempo is it the mood and by using machine learning we're able to stop thinking about that and just make the computer do this this work for us i guess as a fan of music myself i really wonder what the similarity really means right like you probably long ago when you were famous for having all these annotators you probably had to really think deeply about what similarity really means like do you have a definition in your head of what makes two songs similar or not similar i don't know it's it's it's very hard to pinpoint that right when pandora started i don't know like maybe a million better or when it started it's like 2000 and something something like that's when i first started to yeah so back then what people would do is just compute like distances manually they would take like the features and weight them and like manually create like an algorithm that tells us what's similar and what not and they would look how like the lists look like and this is like bootstrapping from nothing right but gradually of course we collected more and more feedback we can replace that just by using models and we don't have to think about that at heart so we have an idea how this works which i obviously share but like it turns out that if you run a model on that it's it figures out a little bit better than humans could do before i guess i would imagine though like one definition of similarity would just be like these two songs like people that like one song like the other song is that how you think of similarity or is it somehow deeper than that like these two songs are fundamentally similar songs like i would think like a person who likes kind of one recent top 40 song might like another recent top 40 song but they might be totally different genres so then does your model try to say these two songs are similar or not similar or what does it do in that case like we are more focused on like this radio experience right so you have like you select like an artist for example or a track to start a radio from and this is like maybe a more direct or specific way to define similarities basically similarities what kind of songs i can play on that radio so if a person likes like top like charts music they won't want to listen to i don't know some hip-hop song on there i don't know dance radio right so if we don't have to in this case we don't have to model like the taste of the user although we do that of course for other other things but in terms of music similarity we really think like okay how which songs can i can we play on this radio station so i would think that your model would need to look at both the sort of like musical elements of the song and kind of other things right because our culture kind of affects our sense of similarity like what what other things does your model look at besides just the audio of the song well it's just as emilia said we have different models for different aspects of this of this whole musical experience right we have models that are just based on the musical features we have models that are just based on the audio and this is like special because when you do recommendations everybody no matter if it's like netflix or or pandora or whatever you have like this long tail of unknown items that nobody really like few people have listened to them so we can't really understand from user interactions who would like them so this is like the way we deal with that we go from the content from like audio or for musical features but for items that are for songs where we already gathered a lot of feedback it's easy for us to just do the classic thing oh somebody like this and this song they're similar to you so maybe you also like them so depending like if the song is very popular or not we can then different recommenders work better or worse got it and so what happens when the model improves how would i as like a a user of your product experience a better model would i notice like when you put out a new version that does better recommendations well we know this right because we just we have a very powerful like ap testing framework that media works a lot with right and when we create a new model and we add it to our ensemble of recommenders or we improve one of the models we just deployed very quickly in an a b test and after a few hours we already get like results so we see like oh people thumb up more or people spend more time listening and media worked a lot of with this a b test right so you know a lot about that yeah i would expect that you personally would not notice other than oh pandora's been really getting it right today [Laughter] yeah but we see things like listeners are thumbing more in one or another direction they're spending more time on pandora they're creating more stations we have a bunch of different things we can look at to see how we're affecting listener behavior okay so maybe let's get a little more technical for our audience i do have a zillion questions as a pandora user but the point of this is supposed to be around how you actually make these models so do you actually like chain these models together it sounds like you take the output of a lot of these different models and then use all those outputs to make decisions in your application yeah absolutely everyone kind of talks about this scenario where like one model changes and has unintended consequences like how do you deal with that like all the models connected together that's a good question one of the ways that we use models is during our song recommendation pipeline the ensemble recommender system proposes a set of candidate songs and passes them to a microservice that handles the real-time evaluation of a machine learning model and that machine learning model is like our larger overarching model that figures out how the other models are informing the decision did that to that let me see if i can repeat it back to you and tell me if i got this right so it sounds like you have an ensemble model or kind of several models that take into account different things like maybe the actual audio quality of the song and you mentioned sort of like non-audio features of the the song and it proposes several songs that you might play next and then you have another model that runs like a micro service that looks at those options and maybe takes into account more things and decides the actual specific song that gets played yeah that's exactly right and some of the features coming into the that final model are from the previous models the the models from the ensemble recommender do you have to retrain the microservice every time you deploy a new model upstream of it i can tell you that the model that the microservice uses is retrained every day so wow with fresh data and we have validation that runs to make sure that our results aren't totally wacky before we actually upload it to the microservice yeah and we have like daily reports that show us maybe like like feature importances and the average value so we can keep an eye on how the model is changing day to day and the nice thing about that is for example when i deployed my my recommendation system last year it's like addictive because you look at the numbers like every every day and like oh yeah it got that many like i i recommended that many songs and people liked it that much it's really nice how easy that works you just add a new model and after like you wait a bit and you like the micro servers post them in and selects them and it's it's really cool and i guess what do you use to keep track of all the the versions this is something a lot of our users are asking us all the time like how do you version your models and version the data that the models are trained on how do you think about that yeah that's everybody asks that because it's a very hard problem right right so if you could walk me through as much detail as you can how you do that because i'm sure a lot of people are wondering yeah for code it's pretty easy right everybody uses git and we use git for basically everything and we have like our own you know instance of a product that we use and all the other code that trains the models and all the code that runs in production is on the server model versions training tracking model versions is way more difficult especially during during development right because you run a lot of experiments you try to compare them what we did until recently we're just like everybody wrote their own libraries that you stored the config somewhere computed hashes and you no so you can track back if you want to find something but that's a pain that was really a pain i would have like a an experiments directory where i had like 200 sub directories with different experiments and i would have another like google sheet somewhere that stores the names of the important experiments so i can know which models to use when i want to deploy them and yeah since we use now since we use waze and biases this got way way easier because we just like lock our experiments we can filter them easily we can compare them very easily it's like we store for example the learned weights there and just pull them when we need to and we decide okay this is the model we want to go with we just download the model and that's it so it's like all the keeping track of models during development is gotten much easier through that i'm so glad to hear that i appreciate the the weights and vices shout out um but not trying to make this only vote awaits advices i guess you know another question we get all the time is like what are the other tools that you use day-to-day to make your life easier as a machine learning practitioner so i think this is like more about like the development part like where we create the model we we train and and like we look at the data and try to figure out what to do almost everybody i work with we use intellij for development because it's just like this one ige that that rules them all it has all the languages like we mostly work with python for the experiments and then once we're done we either use you know python with pi spark or or scala with spark to deploy the the code in production and like with intellij all of this gets so much easier because it speaks all these languages it has very nice plugins to connect to google cloud this is like the service we're using for almost anything now we switched like a few months ago and then also made our lives much easier there's plugins there where you can connect to to a data proc cluster and inspect all the the database schemas and tables and you get column completion when you write your sql statements that just like that so incredible the first time i saw that i was like wow this changes everything so yeah mostly intellij also a very nice thing is like the remote debugging feature so you don't have to log in with ssh in your to your training server and try to debug in the command line you just have like a visual debugger you can inspect the variables in there and run the code remotely still so it's pretty strong too for me and makes my life much easier can you talk a little bit about how you debug models this is another question everyone has can you walk me through your process a little bit when when that goes wrong and actually the performance goes down what do you do well then it's just like going through the code and tracing back what you changed and what might have caused the problem but i think it's more important to never come to this point and you do that by well i try to do that by being a slow starter so i don't try to write the most complex model like right from the start but start slowly first get like some of the data and then i try to make sure the data makes sense so i try to select the small model try to overfit the model on the small data set is it possible i change like for example i randomize the features to make sure that there is for example no problem with with trained test splits or if the model you know actually produces garbage when i put garbage in because that should happen right there's a very nice blog post by andre karpathy like on that topic it's called a recipe for trading neural networks and like he's pretty good at what he's doing so i'm just trying to follow this recipe as good as i can like making sure you understand the data making actually sure that you don't evaluate something that you don't care about so you should make sure that the numbers you get actually reflect what you want to see in the end and yeah that's basically it just being very defensive with your development and checking things again and again it's difficult to debug models because there is no right way and if you make a bug in your neural network training code it will still mostly work it's not like it will crash and burn it will work but works right do you have any bugs that come to mind as as particularly difficult or ones that you've struggled with for a long time i was trading an embedding network that uses the triplet loss and you have to select like positives and negatives right and my the data was stored as a sparse matrix so you had like a matrix which items are connected to which is a ground truth so it's very easy to understand which positives to sample for a given item because there is a one in the the matrix wait so i'm not an expert in the space what what is a one mean here that they're connected say say for example you have tracks and which ones are similar for example but the problem was that when you mask that because when you do a trained testable you have to mask that that matrix so you don't use any data from the test set in your training set but the problem then is that you don't know whether a zero like a no entry means that it's masked because it's in a different split or whether there is actually no connections what ended up happening is that i didn't sample all the negatives that were possible and this of course makes your training harder because you don't get you don't you're not using only all the data yeah and finding that out was pretty like pretty tough because it still works right all right so maybe there's a question for both of you actually but this is a question that comes up a lot that people always want me to ask is how do you communicate the progress you're making with the non-technical people outside of your team but in the company at least for the system that i'm working on we have weekly meetings with our pm who communicates up the ladder we occasionally end of year maybe end of quarter will present to the broader product organization what changes we've been making and how they've been affecting our core metrics i think sometimes people tell me that they have this experience where like other teams are kind of working on engineering projects which sort of like add these features that are very visible but the stuff that both of you are working on can feel more like experimental and there can be like long periods of time where the experiments aren't working and that can be frustrating is that consistent with your experience or not well luckily our direct managers at least mike is i think in the science department every manager used to be a scientist in his previous life so so they know how science works and you can make a lot of progress in a few weeks and you can be stuck for a month and just iterate and experiment and nothing really works the good thing is because of all this you know infrastructure we have the micro service and media was talking about and we can actually trace back every thumb or every song of somebody like to all the individual models so in the end we after like say every quarter of a year we can actually put a number on how many more thumbs we get because of this contribution and how many more time how much more time people actually spend up listening to pandora and since pandora is a ad based like a service that translates very well to money sure that makes sense okay another question for both of you is how do you think about tuning and improving your models like do you do that kind of hyper parameter search that a lot of people talk about or is it more intuitive or is it more structured yeah i think like when it comes to hyperparameter research it's like a hybrid right because we deal with similar problems all the time oftentimes we already have like a good guess what kind of how the model should look like to get reasonable results and this is most of the time this is just like where i start i will just try like five different configurations to see how big or how small i can go and then just settle with a model that works well enough and that's it and then i just like keep it rating on different things like okay which kind of other features can i use can i pull other data and integrate it somehow and once all of this is done once i'm like quite confident okay this is like the the structure the model structure that we'll probably work with these days with weights and biases we just just like create this hyperparameter sweep you don't have to change anything in your code you start it on the friday it runs for for over the weekend or longer depending on the size of the model and then you're done so it's it just saves a lot of headache if you can run this automatically and without much thinking and so you spend most of your time it sounds like thinking about more data you could get or different features you could try than the hyper parameters most of the time yeah honestly because that's where this is like a difference between like working at academia when i was doing my phd and actually working in industry because in academia you have like okay this is the data set that's the standard in this field you take it and you try to like improve you try to create a new method or whatever but at an industry like at pandora it's like okay we want to solve that problem here is all the data we have solve it so you have to think about okay which data makes sense what kind of data makes sense to use which biases would that induce when i use that data so thinking about which data to use how to clean it up because that's a big problem data is of course if you work with real data you have outliers you have some problems here and there i spent a lot of much more time thinking about data since i started working in industry than before definitely all right well maybe let's talk a little more about production i mean you started to talk about the microservices but i'd love to hear more about how you actually serve the models in production yeah we're generating our production models in gcp and then we upload them to redis which is a key value store and that's where the microservice can read them and then to avoid having to go to redis every time we need a model we stash them in a guava cache on heap because at least the models that i work with we're using them every time there's a request for more songs on a listener station so that's so often are these deep learning models are simpler i think like the model you're talking about the microservice is not because just because it has to serve a lot of requests real time so you just can't afford to run a complicated deep learning model at this point like the recommenders in the ensembl there are a few deep learning models there of course but for the final selection of the track i think the models provide enough features and enough candidates to just have a simpler model like that ah cool cool i see so there's sort of like bigger models that run in batch mode where they don't have to be real time is that right and then the final model you talked about has to do real time so it's lighter weight yeah yeah and we definitely have had the experience we've tried like increasing the size of the model and had to pull that back because it wasn't performant enough definitely performance is is something that we're always keeping in mind we don't want the user to wait around so what are the hard parts about getting that model into production what are the day-to-day challenges yeah efficiency is the biggest thing that i worry about that latency again we're always trying out new changes things like philip mentioned like adding new features sometimes we'll try partitioning listeners in a different way so sending different listeners to different models or changing the size of the model and sometimes those changes will look really promising offline but then we try it in production and we'll see that it's too costly computationally yeah the we're getting hundreds of thousands of requests every minute we gotta be super fast have you i'm curious have you always been deploying a new model every day or is that a new process for you for the last couple years we've been doing that every day occasionally we'll skip days if we don't pass validation that day and then somebody will go and look and make sure that it's reasonable or see if we need to make changes but yeah we're staying pretty up to date and what i guess what causes that do do the songs change every day or the the songs people like change every day songs can change every day yeah i think mostly we just want to make sure like we have the the latest data like the latest thumbs the latest completion rates the latest way that listeners are reacting to songs i see so i guess you sort of don't have to worry as much about kind of data drift because you're retraining every day on the latest data yeah i suppose that's less of an issue for the particular model that i'm working with do you have a production monitoring place for that model do you look for signals that bad things might be happening we certainly have dashboards that monitor things like like number of requests and and latencies and cpu thread counts things like that but mainly the way that we we monitor things are those a b tests where we're pretty confident that our control model is pretty darn good any changes that we're making we're comparing against the control model i see how many how many a b test order of magnitude how many a b tests can you run in parallel i'm jealous of how many users you have it must be amazing to to get that data yeah our particular group is running tens probably maybe hundreds if you look at our our broad product area and then i think thousands if if you're looking at the whole company wow is it tricky to swap models in and out in production or is that simple for you it's simple mechanically we can just overwrite the value in the cache but in practice we're a lot more careful we have always run an a b test we never swap anything in without making sure that it's moving metrics in the right way and not degrading the experience for the users but yeah mechanically it's really simple you talk about how you take a model like the steps that you go through from taking a model from like experimental to it's the model that's blessed is the one that runs by default in production i can speak of the recommendation models that we use for stations and that's actually also not that complicated because in the end what you do is you you experiment with the model and then at some point say you you think okay this is the model i want to use so then what i would do is just translate that model into like an airflow dac where it can run weekly or daily or however often i think it's necessary and in the easiest case i would just produce a table on gcp with recommendations and this table gets then pulled just by i'll just ping an engineer and say hey there is a new model around look at this table and they will pull it into this ensemble where all the candidates are are being pulled together and for a certain number of users the microservice will then pick songs by the add by that model and we'll just in the beginning it's just a very small percentage of course so we don't like we don't throw this new model at all the users because we don't know how it behaves right so we would then say try it on one percent of the users and observe the numbers do they like the new recommendations do a thumb up songs recommended by this track by this recommender and also does it make sense to add this recommended to the ensemble at all because maybe the it recommends awesome songs but nothing it doesn't add anything to the mix and that's basically it and then it's in production okay we always end with these questions i'd love for both of you to weigh in if you feel comfortable the first one is what's an underrated aspect of machine learning that you think people should pay more attention to than they do well the thing that i always like think of is not that maybe not that directly technically related to machine learning but in general it's like ethics and like diversity and equality that's our that's like a topic that comes up sometimes now i'm machine learning and it's getting more prominent but i still don't think it's enough because we are just creating all these models that do very seemingly smart stuff but we never like few people actually look at okay what are the consequences of of these things and some even some figureheads from academia industry they're saying oh there's like the models they're not biased we don't really have to care that much about that because the models just they're neutral and it's kind of right right the model has no bias but it learns the bias from training data and the training data we use is like stuff that's happening right now or that happened the last 10 20 years so what the model learns is to like reproduce that bias and i don't see a way to really tackle that from a data perspective just because like let's take the new gpd3 model that the language model by developed by openai was trained on 410 billion tokens how do you change the trading data in a way that it doesn't produce i don't know gender bias or ratio bias it's like it's impossible i think i think we have to think very carefully about how we use these models and how can we integrate some way of human decision making in the whole process and not just blindly trust whatever the model says can i ask does that come up at pandora like i feel like you have in some ways this really wonderful kind of fun application of machine learning and maybe you're one of the few places where there might be less ethical concerns i can imagine some but do you think about it day to day well the reason why i got into music and machine learning is that very reason because it's just a very i know it's hard to do bad things in music but actually we have some discussions about that for example let me give you like two examples one is what we call popularity bias so it's known that basically all recommendation models suffer from popularity bias meaning that it they they recommend popular items more often and most of them actually recommend popular items more often than their popularity would suggest so it's even like they even strengthen they reinforce the whole thing right because it's like a safe choice maybe exactly it's actually quite hard to beat a recommender that just plays the most popular songs right right looking just at numbers so we have some functionality included in our maybe not in the individual models but at the end we try to diversify artists like we try to boost artists that are not very popular because it basically helps everybody it helps the user to find new artists it helps the artists to to get more exposure that they wouldn't get otherwise and so it's a i think it's a good thing to do basically plus some of the recommenders as i said are just looking purely at the musical information right so what how does the song sound like or how does like what characteristics the analysts annotated this is just a way to try songs that don't have much feedback data and are hard to recommend otherwise and another thing that we like recently started discussing and we intend to explore it further is we found that for some genres stations we have a very imbalanced distribution between male and female artists and of course like nobody at pandora decided to make i don't know make the country radio only play main artists but this is what what just happened because we look at what people listen to right and we all we always will always take care that every listener gets what they want to listen to so if somebody just likes hearing male voices they will just get male artists and if somebody just likes female voices they will get female artists but we we're discussing about how can we how can we create a better balance of like new female artists pushing them more in this kind of in the scenarios where we have a strong imbalance here well it's really cool emily do you have any thoughts on that topic i really appreciate that philip is thinking about it i have noticed that the music that i tend to like is male artists but i as a woman would like to support female artists and i would like to be able to find female artists that i enjoy and i would like to see that promotion of female artists happen in pandora we do things like like in the product that will try to offset some of that imbalance for instance during women's history month last year we created personalized playlists for our premium users that were only female artists my playlist was very good nice you should share it in the uh in the show notes [Laughter] uh yeah and we do things we do things too like like black history month we had some personalized i think pandora stories that we shared out yeah so we're definitely trying to make a small bit of difference in that bias this is a pretty broad question but do you have any thoughts on i feel like sometimes these recommendation systems and machine learning in general kind of gets a knock for sort of optimizing for our reptile brains i could see that with pandora maybe wanting in the short term to hear the same songs over and over but in a more i don't know higher brain sense like wanting to be exposed to new music do either of you think about that day to day do you feel like it's possible to over optimize for a thumbs up or a listen times definitely yeah but it's something that we always have in mind so of course like the direct metrics are time spent listening and so on but we definitely hear users saying okay there is just too much repetition and then we just try to this is something that's very hard to measure in a very direct way and what emilia said is that if you just blindly reduce repetition because it's an easy thing to do it tends to annoy some people it's a balance that you have to find between having enough novel content and knowing which users like more novel content and which users prefer to hear the same old songs all the time so it's definitely something that we have to keep in mind and we do yeah one of the things uh that we're doing in the product too related to that specific question is the modes so if you're on like an artist station and you're getting tired of your normal station experience and you're really wanting to get some new stuff in there you can go into discovery mode and you'll get some really fresh songs but then when you get tired of hearing new stuff because that's sort of exhausting like constantly having new content thrown at you you can go back to your old experience awesome well the final question and we're running out of time but i want to make sure i ask it is what's the biggest challenge of actually getting machine learning models deployed in the real world from the beginning of the kind of conception of it to actually in people's hands giving them better music where are the surprising bottlenecks well i think we talked about that a little bit already for me coming from academia was that first of all different things like you approach the problem from from a different point of view because before you just have the the data set you try to improve the model even if it's just by like one percentage point percentage point accuracy and now you it's more like you have a problem and first the first step is to find the data that solves that problem so you have like this huge data store and okay how can i find the data that solves the problem then you like develop a model which is pretty similar and then at some point you have to ask yourselves when is the model good enough because you can always keep on tuning this is like science right so you can just keep on improving forever and here the difference is of course that like one or two percentage points improvement in academia gets you a new paper in an industry it might not even matter because the impact on the end user is so small because you have like 100 other recommenders in the ensemble and then for me the hardest part was to to just let it go at some point and just say okay this is it that's that's enough for me i totally already mentioned this but the biggest challenge is always making sure your machine learning model is performant enough to make predictions in real time i think during the research phase of development you can focus on the accuracy of predictions without worrying a ton about the latency of the predictions but in production the prediction latency has to be low enough that a user isn't waiting around for results so there's definitely a balance there between um the effectiveness of a model and the efficiency of a model i mean spoken like someone who really has models in production it's so great to talk to both of you i really appreciate it that was super fun i feel so proud that we could help you guys at weights and biases we make this podcast gradient descent to learn about making machine learning work in the real world but we also have a part to play here we are building tools to help all the people that are on this podcast make their work better and make machine learning models actually run in production and if you're interested in joining us on this mission we are hiring an engineering sales growth product and customer support and you should go to wmb dot me slash hiring and check out our job postings we'd love to talk about working,7006 +Luis Ceze — Accelerating Machine Learning Systems,https://www.youtube.com/watch?v=K2_FbDsB3j4,2908,2021-06-24,"Luis: I've never seen computer systems architecture and systems optimization being as interesting as it is right now. Because there was a period of researching this, it was just about making microprocessors faster, making a little bit better compilers, but now that we have to specialize and there's this really exciting application space with machine learning that offers so many opportunities for optimizations, and you have things like at FPGAs, and it's getting easier to design chips. We create all sorts of opportunities for academic research and also for industry innovation. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Luis Ceze is co-founder and CEO of OctoML, founder of the Apache TVM Project, and a Professor of Computer Science at the University of Washington. He's an expert in making machine learning run efficiently on a variety of hardware systems, something that I'm super fascinated by and don't know a lot about. So I could not be more excited to talk to him today. Why don't we just kind of jump right in, I guess. You're the CEO of OctoML, right? And that's based on the Apache TVM Project that I think you also authored. Can you just kind of, for people who don't know, kind of give a description of what that is? Luis: Yeah, sure. And maybe a quick intro. I wear two hats, I'm CEO of OctoML, and also a Professor of Computer Science Engineering at the University of Washington. I have many machine learning friends. On the area, I mean machine learning systems, so what does that mean? It means building computer systems that make machine learning applications run fast and efficient, and do what they're supposed to do in the easiest way possible. And often we use machine learning in making machine learning systems better, which is something that we should touch on at some point, it's an interesting topic. Apache TVM...TVM stands for Tensor Virtual Machine. It started in our research group at University of Washington, about five years or so ago. And the context there was the following. Five years ago, which in machine learning time is just like eons ago, there was already a growing set of machine learning models that people care about. In a faster and faster growing set of those, the fragmentation in the software ecosystem was just starting, the TensorFlow and PyTorch, MXnet, Keras and so on. And then hardware targets, that time were mainly CPUs and the beginning of GPUs, and a little bit of accelerators back then. But our observation then was that, while we have a growing set of models, growing set of hardware targets, and then this fragmentation, it's either you have a software stack that is specific to the hardware that you want to deploy your model to, or they're specific to use cases like computer vision, or NLP, and so on. We wanted to create a clean abstraction that would free data scientists, or machine learning engineers, from having to worry about how to get their models deployed. We wanted to have them focus on the statistical properties of the model, and then target a clean, single pane of glass. Clean abstraction across all of the systems and hardware such that you can deploy your model, and make the most of the hardware targets...make as much as possible from a hardware targeting point too. As you all know here, since there are a lot of machine learning practitioners that listen to this, machine learning code is extremely sensitive to performance. Uses a lot of memory, uses a lot of memory bandwidth, which means that you use a lot of the ability of moving the data from memory to your computer engine and back, and also use a lot of raw compute power. That's why, you know...hardware that is good for machine learning today more and more look like super computers of not too long ago. Like vector processing, and matrix, tensor cores and all of these things, a lot of linear algebra. Making the most out of that is really, really hard. I mean, code optimization is already hard. Now, if you're optimizing code for something that's as performance-sensitive as machine learning, you're talking about a really hard job. So anyways, I'm getting there, I know it's a long story, but hopefully it'll be worth it. So at TVM, what started as a research question was that, can we automate the process of tuning your machine learning model and the actual codes to the hardware targets that you want to deploy to? Instead of having to rely on hand tuned libraries or relying on a lot of artisan coding to get your model to run fast enough, we wanted to use machine learning to automate that process. And the way that works is TVM runs a bunch of little experiments to build, really, a profile or personality of how your hardware behaves, and uses that to guide a very large observational space to tune your model and your code. So the end result from a user point of view is that you pick up all this input in TVM, you choose a hardware target, and then what TVM does is, it finds just the right way of tuning your model and compiling to a very efficient binary on your hardware target. Lukas: And I guess when I think of like- Luis: Does that answer your question, what TVM is? I know it's long, but I hope it was useful. Lukas: Yeah, no, it's great. I want to ask some more clarifying questions, I guess. I'm not a hardware expert at all, and I guess what I've observed trying to make ML models run on various hardware types, is that it seems like it's harder and harder to abstract away the hardware. It seems like people are really like kind of building models with specific hardware in mind, sometimes specific memory sizes, and things like that. And I guess my first question- Luis: And that's what we want to change. We want to remove that worry from the model builders. We want them to focus on building the best statistical properties possible, and then everything else should be left for engines like TVM and the Octomizer, I can tell you more about later. Lukas: And so this TVM though, is it actually like a virtual machine? Is it doing kind of real-time compiling to the hardware as the model runs? Luis: That's part of the work. yeah. So TVM by and large, we call it just-in-time compilation. So, the reason the just-in-time compilation is important is because, well, you learn more about the model as you run it, as you evaluate it. And then second, you can do measurements of performance and make decisions about how we're going tune the rest of your population. So, it is a virtual machine in the sense that it offers a clean abstraction. It's not a virtual machine in the VM-ware sense. It's more like a virtual machine in the Java virtual machine sense. Which could be a whole different conversation. It's even closer to my world as a computer systems architect, is thinking about those kinds of abstractions. But TVM is a virtual machine in the sense that it exposes a well-defined interface for you to express what your model does and gets that lower down to the hardware target. Lukas: Got it. And is this typically for deployment or could it also apply for training time? Luis: Great question. So TVM so far, by and large its use has been for inference. So you have a model that's being trained. You often have done quantization too, by then and so on. And then you run it through TVM because... We see that as a strength, is that you apply all the optimizations that could change the statistical properties of your model, and you validate your model that way. And then whatever we do from there on should be seen as a process that preserves exactly what your model does. We don't want to change anything because we see that as complementary to all of the optimization that model builders would apply before then. So then once again, this is really like a compiler. It's a compiler plus code generator plus a runtime system, and we specialize everything to your model and the hardware target. We really produce a custom package that is ready to be deployed, that has custom everything. Has a custom line of operators for your model, has a custom runtime system for your model, and then wraps it up into a package that you can just go and deploy. Lukas: Got it. And are you picturing, typically, is this kind of like edge and like kind of low-power compute environments? Or is this more for like servers? Luis: Yeah. Great question. So, remember that I was telling you about automating the process and using machine learning to discover what the hardware can do and can't do well and use that to guide your optimization? That frees us from having to make that choice because essentially as long as there's no magic involved...obviously if you have a giant GPT-3-like model you want to run on a one-milliwatt-power microcontroller, this is just simply not going to work, that's obvious. But in terms of the actual basic flow of having what we call cost models for the hardware target, and use those predictive models to guide how to optimize the model for that specific hardware target, essentially it's the same from teeny microcontrollers all the way to giant, beefy GPUs or accelerators, or FPJ-based stuff that we support as well. That means that TVM doesn't have any preference of either. So we've had use cases both in the open source community, in the research space as well, that we support, and we still do it ourselves. All the way to our current customers at OctoML, we have customers for both edge deployment and cloud deployment, because the basic technology is effectively the same. Some of the actual deployment aspects and the plumbing changes a bit. If you're going to deploy it on a tiny device, you might not even have an operating system, for example. So we support some of that. That's different than a server deployment, but the core aspect of how to make your model run fast on hardware targets is essentially the same. Lukas: I see. I guess for kind of server-level deployments, I feel like with the exception of TPUs and a few companies, it seems like almost everyone deploys on to, like, NVIDIA stuff. Is this sort of like outside of CUDA and cuDNN, or does it translate into something that can then be compiled by CUDA? How does that work? Luis: Yeah, this is an excellent question. So first let's think about just a world with NVIDIA, and then let's just free ourselves from that tyranny, which actually is part of the goal here too. No, I love NVIDIA, I have many friends there, I admire what they do, but people should have a choice. And there's a lot of really good non-NVIDIA hardware. NVIDIA makes great hardware but there's a lot of really great non-NVIDIA hardware here. Let's start with NVIDIA. Let's imagine a world that all you care about is deploying on NVIDIA. So NVIDIA has at the very lower level of their compilation stack, they do not expose their, what we call, instruction set. So that's actually...it's kept secret. They don't expose it. You have to program using CUDA, that's the lowest level. And there's cuDNN on top, and also parallel to that, you have TensorRT for example, which is more of a compiler that you compile a model to the hardware target. TVM can be parallel, but at the same time use those. So here's what I mean. Both cuDNN and TensorRT are generally guided and tuned and improved based on models that people care about, and moves with where the models are going. There's some fair amount of tuning that moves with where the models go. Whereas TVM, again, generates fresh code for every fresh model. So that means that in some cases we do better than TensorRT and cuDNN, just because we can specialize enough in a fully automatic way to the specific NVIDIA GPU that you have. And then we generate raw CUDA code that you just compiled out. So essentially you run your models to TVM, which is a ton of CUDA codes, and then you compile that into a deployable binary on that specific NVIDIA GPU. But in the process of doing that, TVM... I mean, we do not take a dogmatic view that you should only use TVM. In some cases, of course, NVIDIA's libraries or NVIDIA's compilers like TensorRT can do better. And we want to be able to use that too. So what TVM does, it does what we call ""best of all worlds"". The process of exploring how to compile your model, for parts of your model, say a set of operators, it sees TVM's version and then cuDNN and TensorRT and thinks, ""Oh, this operator is better to use cuDNN"", and you just go and put it in. Then we link the whole thing together, such that what we produce for you, it could be a franken-binary. So bits and pieces are parts of cuDNN, maybe TensorRT, or TVM-generated code, and produces a package that essentially is specialized to your model, including the choice of whether you should or should not use NVIDIA's own software stack. Okay. Did I answer your question on NVIDIA? So this how- Lukas: Yeah, totally. Luis: And by the way, this is just TVM. We should talk about the Octomizer later. The Octomizer, you want to abstract all of that away even further. Which is, you upload your model and then you can choose. You have a checkbox, all sorts of hardware. There's Intel CPUs, AMD CPUs, NVIDIA GPUs, soon, AMD GPUs, and then Raspberry Pis, and an some cases you might choose to run and use a native stack for use. You don't even have to think about that. That's really what we want to offer, like we do not have to worry about it. Apache TVM, let's just focus on the open source now, has got quite a bit of traction in both end users and hardware vendors. End users, companies like Microsoft, Amazon, Facebook and so on, have used it. Some of them using heavily today. But now hardware vendors got more and more into TVM, who are like ARM built their CPU, GPU, and NPU compiler and software stack on top of TVM. We're working with AMD to build one for AMD GPUs as well. Qualcomm has built their software stack with TVM, and we are working with them to further broaden the reach of the hardware that is supported by that. The reason I'm telling you this is that as we enable hardware like AMD GPUs to be used very effectively via TVM, I think we will start offering users meaningful choice here. They should go with the hardware that better serves them without having to necessarily choose that based on the software stack. Lukas: Can I ask a couple of specific questions? Luis: Does that makes sense or nah? Lukas: No, that makes total sense. So we do a lot of work with Qualcomm and they talk a lot about ONNX, which I think...my understanding is that's sort of a translation layer between models and places, like hardware that they could deploy on. How does that connect with TVM? Luis: Yeah. So there's no visualization I could show you, but think of it as there's a stack. So, at the lowest level, you have hardware and then you have the compiler and operating system, then you have your code generator. So that's where our libraries are, too, that's where TVM sits. And then on top of that, you have your model framework, like TensorFlow, PyTorch, Keras, MXNet and so on. ONNX as a spec is wonderful. Essentially it's a common language for you to describe models. And TVM takes as input models written as specified in ONNX, but it also takes native TensorFlow, native PyTorch, native Keras, MXNet and so on. But ONNX, if you go to the Octomizer service today, you can upload an ONNX model. And then in the guts of the Octomizer, you go and call TVM to import the model and do its magic. Think of ONNX as a language to describe models. Lukas: Do you think that...I feel like one of the reasons that I've heard that NVIDIA has been so hard to displace as sort of the main way people deploy most of their stuff is because the cuDNN library is so effective. Do you sort of imagine that as TVM gets more powerful, it opens things up to other hardware companies? Luis: That's right. Yeah. I think NVIDIA has been brilliant in offering... I mean, they have really, really good software stack and they of course have good hardware too. But the fact that they have a usable and broad, and I would say, arguably some of the best low-level machine learning systems software stack there, gives them a huge advantage. Some other hardware could be just as good in terms of raw processing, power model, memory, and kinds of architecture and so on. If they don't have a good software stack, they're simply not competitive. And we definitely see TVM as offering that choice too. Again, I don't want to sound like we are going to compete with NVIDIA. That's not the point. I'm just thinking... So just think about this. Forget machine learning. Just think about operating systems. So you have Linux. Linux runs in pretty much all the hardware that you care about. You might still choose to run Windows, but at least in the same hardware, you can choose to run Windows or Linux. Think of TVM as offering a choice of what kind of operating system you'd run on your hardware, except that you don't have to choose a proprietary one. In the machine learning world, with NVIDIA there's essentially no choice there unless you're going to go and write CUDA code directly. Lukas: So I guess one of the things, and this is probably the part of the show where I ask the dumb questions that my team is going to make fun of me for, but kind of in the back of my head, I feel like I always have this mystery where like a new version of cuDNN comes out, and the models get way faster with just a better library. I think about what a model does, like a convolution or like a matrix multiplication. It seems so simple to me. That's kind of how it seems like, because I feel like I come from a math background and I'm just like, how could — many years in to making a library — how could there be a 20% speed up on a matrix multiplication? Like what's going on? Luis: That's a brilliant question. Yeah. Great question, Lukas. All right, we should take a whiteboard out and I'll show it to you, because then it gets even closer to my world. Let's think about computer architecture for a second. Let's say that you are an execution engine, like a processor or a core in a GPU. So you have to grab, let's start with one reason, you have to grab data from somewhere in memory. It turns out that computer memory is organized in ways that, depending on where the data is in memory, which actual address physically in your memory it is, it gives you much better performance than others, by a huge margin. Because depending on how you lay it out, the data, you can actually make the most use of the wires between your memory and your processor, between your cache and your actual execution engine in the silicon itself. But figuring out where that goes becomes a combinatorial problem because not only you have to choose where the data structures go, but also when you have a bunch of nested loops that implement your convolution, you have to choose, like, if you have a four-deep nested loop, in which order should you execute them? Many orders are valid. Which order should you execute them? And then within those, you might want to traverse...like what size of blocks are you going to traverse that? All of that is highly dependent on the parameters of your convolution. I'm just picking convolution, so even just general matrix multiplication. Long story short, for any given operator, there's literally potentially billions of ways in which you can compile the same bit-by-bit equivalent program in terms of outputs. But one of them is going to be potentially a thousand times faster than the slowest one. So picking the right one is hard. Often, this is done today by human intuition and some amount of automatic tuning called auto tuning. What's happening in cuDNN as your model gets faster is that...NVIDIA can afford a large number of programmers, so a lot of really talented software engineers, they observe where the models are going. There's some models that matters to them. They're going to go look at the model, see the parameters of all of the operators, how they're stitched together. Then they're going to start tuning the libraries to make sure that they do better data layouts. They make better loop ordering. They do better tiling of how the data structure works. They choose the direction which they're traversing, data structures, and so on. And that's just one type, that's just one operator. But now models, operators talk to other operators. So that's why there's something called operator fusion. If you fuse two operators, for example like a matrix multiplication, the convolution, to a single operator, now you can generate code in a way that it can keep data as close to your processing engine as much as possible. You make much better use of your memory hierarchy and that's yet another significant performance bump. Am I giving you a general sense- Lukas: Totally, that was really helpful, yeah. So I guess you can't actually decompose the problem down into... I was sort of picturing that each step in the compute graph, you could optimize it separately, but actually you have to- Luis: No, you have to put them together. In fact, if you read TVM, there were three PhD theses. At the very least, those are the ones that I've been involved in on the core of TVM. If you read the first paper, it's been around for several years now, one of the key messages there at the highest level was the following — by doing high-level graph optimization together with code optimization, that's where a lot of the power comes from. So essentially, say, if you choose to fuse two operators in the graph, now we need to generate really good code for it. So now you're going to go and use our automatic, highly specialized code generator. They use machine learning to do the search for this new operator that fused the two with different parameters. By combining high-level graph optimizations with low-level code generation that's specialized to that, you have significant multiplicative optimization opportunities. Lukas: Interesting. Luis: Does that give you... Lukas: No, that's really helpful, yeah. Do the new TPU architectures kind of change anything about this optimization or does it change what you're doing at all? Luis: Well, it's a different hardware architecture, so you need to go and tune for it as well. But you remember that TPUs are also made of a bunch of transistor function units and floating point units and vector units, and they have wires. They have memories organized in a certain way that you want to make the most of. In a sense, a lot of these specialized architectures, what they do — and in fact TVM also has an open source TPU-like accelerator that's fully open source hardware, you can actually stamp it out in FPGA, some folks have stamped it out in actual custom silicon — it gives you sort of a template of how you think about these accelerators. They also have parameters. So there are different sizes of memories and buffers, what data types you support, how many functional units you have to have the right throughput. It's all a balance of how you organize your memory, how much of your silicon you're going to devote to compute versus storage, how many wires and how is your interconnection network to move data around this connected? The reason I'm telling you this is that many times the trade-off here is the following. You might make the hardware more complicated, harder to program, but immensely more efficient. But that means that now we need to rely even more on a compiler to make really good code generation and specialize how you're going to compile your code to that specific hardware target. Lukas: Right, right. Luis: Because that's a fair trade-off. Compilation, you do once, it might be complicated, but then if you subsume harder, they have to do every time as data is flowing, it's much better to do it ahead of time. Lukas: I'm digging deep into my computer science education, but I feel like the story with the non-deep learning chips, hasn't it been sort of simpler, kind of like smaller instruction sets, and trying to simplify things? It seems sort of the opposite direction of adding complexity to the hardware and then relying on the compiler to deal with it. Luis: Yeah. Yeah. It's a great question. There's so many, and I think it could be a whole other conversation too, but the whole...when the RISC versus CISC debate happens in the computer architecture class that I teach — at grad level, I actually have them have debates — the key aspect there was that by going to a simpler instruction site, you had simpler hardware so you can actually clock it faster. You could have lots of little instructions, but you execute a lot of them in a period of time, so you can make them run faster. It turns out that even complex instruction computers today, like x86 and Intel, they break it down automatically to teeny structures and it still looks like a RISC computer inside. But fast forward to today, what's going on is that there was a huge change in the trends we've seen in performance coming from in different computer architectures. As we get closer and closer to the limits of scaling of transistor technology, what happens is the following. You have a certain number of transistors, they're getting ever smaller and more and more power efficient. There was a change that transistors are getting smaller but not necessarily much more power efficient, which means that you can pack more transistors on the chip, but it cannot turn all of them on at the same time. You might seem like, ""Why am I telling you this?"" Because that's the whole justification for going more and more specialized, and hav[ing] a big chip, a bigger chip with lots of different ""more specialized function units"". They're not general, but they're much more efficient because every time you add generality in the hardware, fundamentally, you're adding more switches. For example, general purpose CPU that can do anything. There's like a large fraction, more than half of the transistors there, just simply sitting there asking questions. ""Am I doing this or that? If I'm doing this, I do this."" And then you have to make decisions about the data that's flowing through because it's supposed to be general. So the trend that we're seeing now is that, well, we need to make this thing much more efficient, otherwise we can't afford the power to run a global infrastructure, or you can't afford the power to run machine learning. You have to squeeze efficiency from somewhere and the way you squeeze efficiency, you remove all these transistors just sitting there wondering what they should do, with transistors that do only one thing, and one thing very, very, very well. Sure, it makes it harder to program because now you have to figure out when and how you should use these specialized functional units, but immensely more efficient in terms of performance per watt and immensely faster than general purpose computers. Did that answer your question or did I make it more complicated? Did I confuse you, or did I... Lukas: No, this is incredible. I feel like I'm finally getting clear answers to questions that have been in my head for a long time, so I'm actually really enjoying this. What should I be imagining as like a specialized instruction? I hear on the M1 laptop, there's like a specialized thing to play videos... what does a specialized instruction look like? Is it like there's a convolution structure, so it could pass through? Luis: Yeah. For example, it's an eight-by-eight matrix multiplier, single instruction. Lukas: Really? Luis: Yeah. You can evoke that. You set up, you put all the data in the right place and you say eight-by-eight matrix multiply. Boom, it happens. Lukas: In one tick? Luis: Not exactly in one tick. It's like, it's one instruction, which means that you're giving one command. It could be broken down into multiple cycles depending on how it's scheduled. But from your primary point of view, there is hardware there that's essentially in the arrangement of your transistors that implements your functional units, and your memory's organized in a such a way...there's something called a systolic array, I don't know if you've heard this term before. Lukas: No. Luis: Systolic array, it's an array of multiply and accumulate. So think of it that way. You can just flow data in a specific way that, if you just arrange it just right, then you flow between, in one flow you've done an eight-by-eight RTMM (?). But to do that, you have to arrange all the data in the right place and then click go. Not click, issue instruction ""Go"". But now to answer your video compression question or video codec. We call it an instruction, but more likely, it's essentially a piece of hardware that's just sitting there, knows where to read data from, and what you do is just configure it. You're not giving... the program for real is actually in the actual function-specific hardware. And all you do in your code is to say, ""Activate that now. Here's the data stream, activate that"". Then you have a fixed function hardware that just starts crunching through that and decoding your video, for example, or applying a certain computation. Another thing that people are doing in hardware is activation functions. Some activation functions are so popular. People use it all the time, but why are you going to break it down into 30, 40 instructions when you can design a piece of hardware that does that and just that. And all you're doing is when you call that activation function, you just activate that piece of hardware. Lukas: Wow. So I guess if it's sort of laws of physics that are pushing this trend, it seems like you'd probably expect this trend to continue for a long time, right? Luis: Oh yeah. Lukas: And if it does, where would it go? Would there be sort of even more and more complicated structures possible in the hardware, and wouldn't that sort of make research harder? What if you wanted to do a new activation function that wasn't available? Luis: Yeah. So that's a really great question, Lukas. Let me try and answer the first big question first, and then we can branch down to these other sub-questions about research and how do we continue advancing this. So, yeah, that's the reality. Right now, we already have quite a bit of diversity, not just different hardware chips and hardware parts. Just look at all the AI chip companies out there, just look at what's happening to the general purpose processors, like Intel processors getting specialized instructions that are relevant to machine learning and so on. That's going to continue going because, honestly, there's just no other way to get efficiency. Unless, now let me open a nerdy speculation, unless we can teach atoms to arrange themselves at the atomic level to go, it's like ""Let's reconfigure where your wires are"", and therefore you have your chip doing a new thing. Lukas: There's a kind of chip like that, right? Like a FPGA. Is that it? Luis: Yeah, but I'm going to get there. But there's no magic. FPGA's is just, there's a bunch of wires that are there, you're just inserting data to tell you how, which wires you should use. But the wires are always there. And just the fact that you have a table that tells you, ""If I have this bit on, I'm going to use this wire. If I have this bit ton, I'm going to use other wire"", just that causes inefficiency. So it's always a trade-off. Think of it as a trade-off between how general or hard is your...so there's a generality over specialized curve. More general, less energy efficient, easier to program. More specialized, more efficient, harder to program and so on. But then you have FPGAs. How about FPGAs? FPGAs essentially, they are very general fabric with a very complicated programming model. Because FPGA, what they are is, they're a bag of wires and little routing tables, and sprinkled some multiply and accumulate, or more and more activation functions and other popular compute elements that you just sprinkle in, in an even fabric. And then you just set bits to figure out how you're going to route the data. So the way you program that looks like how you design hardware, and they can be very efficient if you do it right. But fundamentally they're not going to be more efficient than true fixed function chips. You're never going to see an FPGA competing with a GPU on the very same task. You see FPGAs competing with things like GPUs and so on, when you can specialize to your application, and even with the efficiency hit of the hardware, you still have a win. Does that make sense? Lukas: Totally. Luis: So for example, let's say if you decide that you want two-bit data flow for... let's say quantization to two bits here in one layer, three bits on the other layer, and I know one bit on the other layer. It just so happens, there's no existing silicon that can do that for an existing CPU, or GPU that can do that for you. Chances are, you're going to be living with an eigh-bit data plane, and you're going to ignore some bits there, and then you're going to waste efficiency there, or you're going to do inefficient packing. But with an FPGA, you can organize it such that you only activate...you only route your circuits to use the two bits or one bit or three bits. In that case, because the data type is more unique, you can specialize to your model, then you can do really well with an FPGA. Lukas: That makes sense. Luis: And on research, to answer your question on research. Research, I think, is getting more interesting, honestly. Maybe I'm getting old and a curmudgeon here, but I feel like — I want to say curmudgeon, I mean I'm being old and optimistic here — is that I've never seen computer systems architecture and systems optimization being as interesting as it is right now. There was a period of researching this, it was just about making microprocessors faster, making a little bit better compilers. But now that we have to specialize and there's this really exciting application space with machine learning that offers so many opportunities for optimizations, and you have things like FPGAs, and it's getting easier to design chips, we create all sorts of opportunities for academic research and also for industry innovation. Hence, we see all these wonderful new chips, Xilinx with new FPGAs, and new FPGA companies, and some are novel reconfigurable fabrics, and all of these cool hardware targets. Lukas: I guess I'm curious, it seems like ML is becoming a bigger and bigger fraction of data centers, and data centers are becoming a bigger and bigger fraction of global energy use. Do you feel like there's an environmental impact that you can have by making these things run more efficiently? Luis: Absolutely, yeah. And we're not the only ones to make that claim. Essentially, every time you make an algorithm faster in the same hardware, you're saving energy, you're saving trees. You're reducing resource pressure. Performance optimization is this wonderful thing that you can reap the benefits in so many ways. If you make it faster, you're gonna make your users happy. But also even if it's not latency sensitive, you're going to make your finance folks happier because they're gonna spend less on cloud bills. But in the end you're going to be using less energy. And that really matters. Now, what's interesting about environmental impact specifically is that, as you pointed out, there's a growing fraction of energy in the world that's devoted to computing. I'm not going to get into cryptocurrencies. We're not going to go there right now. That's a whole separate topic, thinking about the energy costs of that. Let's just think about the energy costs of machine learning infrastructure, that includes training and deploying models at scale. It's fair to say that in a typical application that uses machine learning today, the majority of the cycles will go to the machine learning computation, memory that you have to keep alive with energy. Anything that you can do to make the hardware more efficient, to make your model more efficient at the model layer, or making it via compiling and optimizing the model specific hardware, is a win, both in terms of user experience and energy efficiencies. By making it more energy efficient, you make it much less environmentally impactful. So, absolutely. You should take every opportunity you can to reduce the energy that your models use, especially if it's applied at scale. Even if it doesn't matter from a user experience point of view, we should do it because that's just the right thing to do. Lukas: Can you really separate the model compiling and performance, and the way that the model is designed? It feels like a lot of the performance improvements in models come from sort of relaxing the constraint that you need to exactly do the convolution or the matrix. I mean, just for example, quantization, where you do it in a ludicrously small level of precision, it seems to work really well. Luis: No, absolutely, no. And I did not mean to imply that we should only do a model compilation. Remember that I said, I'm assuming that you're going to come with your model tuned for the least amount of computation you can possibly use. That's the ideal case, but you're absolutely right that there are optimizations at the model level that actually changes the statistical representation of the model that enables new optimizations. And we can do that too, but TVM does have growing support for quantization. But what I'm particularly interested in, in general, is how do you put things like TVM in the whole network architecture search loop? As you make decisions about your model architecture, and as you retrain for different model architectures, and you can make new optimization decisions on the model layer, and change the convolution, the data types and doing all sorts of things like pruning and compression, deep compression, et cetera. Now, put a compiler in the loop, like TVM, and measure what the performance that you're getting as part of your search loop, because then you really get the synergies. You're right that you cannot completely... you can decouple them in principle, and you're still going to do relatively well. But if you do both of them together, I think you're up for more than the addition of either of them in terms of potential opportunities. That's what TVM did in terms of high-level graph and low-level optimization. By doing them together, we show that we can do better. And I do think that the same thing... I have data points to show that the same thing could happen if you do model building and tuning decisions together with a low...model compilation and hardware tuning together. Lukas: Are there trade-offs between... Like with GCC, you can optimize for memory or you can optimize for speed. Is there a latency-memory size trade off here? Or are they both sort of like aligned with each other? Luis: Yeah. So that's a great question. Of course, one optimization that definitely impacts memory usage specifically is when you do model compression or if you do quantization. So if you go from FP32 to int8, you already have a 4x footprint reduction in your... You go from 32 bits to 8 bits- Lukas: But that'll also make it run faster, right? So there's no real trade-off there if the quantization keeps the performance UI, right? Luis: Potentially. If you're assuming quantization that's just like, you have the same model architecture and you just change the data type and go. But that's sort of like the easy, lazy quantization. The right way of doing it, in my opinion, is that once you change the datatype, you're given an opportunity to actually go and retrain it and some parts of your model become less... I think the right way of doing quantization is not just ""Quantize your data type and forget about it"". It's actually ""Close the loop and put it on a network architecture search"", such that as you change the data type, you actually allow for different types of... and then in that case, I think you're up for significant changes to the model that would make quantization potentially even more effective. But I did not answer your question. So what's the trade-off between latency and footprint? Well, it could be that. It could be that you actually quantize your model, but then you actually make it deeper to actually make up for some accuracy loss, which might make your model actually potentially slower, but use a lot less memory. So there is that trade-off there too. Lukas: I guess my experience of deploying models, and I'm just an amateur at this, but I love my Raspberry Pis and other cheap hardware. Luis: And we support Raspberry Pis pretty well in TVM, you should try it out. Lukas: I will definitely try it after this. So I did it kind of in the early days of trying to get TensorFlow to run, when even that was a challenge. And I felt like basically, with models, it was sort of binary where either I could fit it in the Pi's memory and it would run or I couldn't fit it in the Pi's memory and it wouldn't run. So it seemed like less about sort of optimizing and just, either I'm sort of stuck or I'm not. Is that a common situation, or? Luis: It's hard to say if it's common. Often, at least for the models that we get, they get to the point where we pay attention to them and we know that they run now, but they typically don't run, say, the frame rates that you want to get. Half a frame per second and you can run or show your path to 20 frames per second. By that time, the model already fits, you're optimizing for performance. But often, this performance optimization comes also with model size reduction, quantization is another one. Let's say if you can just go from FP16 to int8 and it works well, boom, you do that. You probably improve performance and you also reduce model size. But I've seen plenty of cases where the model already runs and what's hard is [to] actually get to target latency that would actually enable the model to be useful. That's actually by and large what we tend to see, you get your model to run, you hack it enough to go there, but then it's never fast enough. And then you're going to go in, you know another 10x ahead of you for it to actually be useful. Lukas: Totally. I don't want to not ask you about your company OctoML. I feel like you're one in a growing line of people that I'm talking to that are professors and they're starting companies. I mean, what inspired you to build this company? Luis: Yeah. Great question. So first of all, it's one of those moments where all the stars are aligned. TVM had gotten quite... We started the company just about a little under two years ago. TVM, it had quite a bit of adoption by then already, and we saw more and more hardware vendors starting to choose TVM as their chosen software stack. We ran our second conference here in Seattle and I saw there's a room full of people. I thought, there's an opportunity here to make what TVM can do more broadly accessible. And then I said the stars are aligning because I was looking to start another company, and I had become full professor a couple of years before then.A lot of the core PhD students in CVM were graduating. One of our big champions of TVM, Jason Knight, was at Intel at that time, was one of our co-founders, and was also looking to starting something and all the stars aligned. I feel extremely lucky that we had that group of people ready to start a company. And we work really well together. There's a lot of synergy there. But that's sort of like ""the stars aligned"" part. Now in terms of technology, it became really clear to all of us that, look, you have this cross-product between model and hardware, and there's such a huge opportunity to create a clean abstraction there, and at the same time automate away what's becoming harder and harder about making machine learning truly useful and deployable. Honestly, in MLOps — and I don't love that term, because it means so many things — but going from data to a deployed model, it's clear that the tools to create models got good pretty fast. There are a lot of people that can create models today, and good models, a large repository of models to start from. But after interviewing a bunch of potential customers, we realized that, hey, you know, well, people have actually have a lot of difficulties in getting models to be deployed, precisely because of the software engineering required and the level of performance requirements and cost requirements to make it viable. So we formed OctoML to essentially make TVM even more accessible, or technologies like TVM even more accessible, to a broad set of model builders, and also make it part of the flow. Let me just tell you briefly what the Octomizer is. So the Octomizer is a machine learning acceleration platform. It has TVM at its heart. You have a really clean...either API, just a couple of calls, ""Upload model, choose and then download, optimize model"", and you can choose a hardware target. You upload them all, then you can choose the hardware targets that you want. The Octomizer calls TVM, or also can use ONNX run time end we're going to keep adding more...again, we want to offer users the abstraction that you upload the model and you get the fastest possible model ready to be deployed in your hardware in a fully automated fashion. You either get a Python (?), ready to download, or we're working on GRPC packing, so we can deploy in the cloud or cloud functions and so on. So the value add here is all this automation that we provide on top of TVM, and also the fact that, as I mentioned, TVM uses machine learning for machine learning, and we have a data set for a lot of the core hardware targets that the world cares about just ready to go. So you don't have to go and collect it yourself. Lukas: I would think running OctoML, you would have real visibility into how the different hardware platforms can compare with each other. I'm sure you don't want to offend the hardware partners, but do you sort of have first-pass recommendations for what people should be targeting in different situations? Luis: Yeah, and that's one of the things that I want the numbers speak for themselves. So what you can do is you can, if you come to the optimizing facts, we are open for early access and we actually have some real users already using it regularly. So you upload a model, then you can choose all sorts of hardware tags, and then what you're going to get, you're going to get a dashboard saying, ""Here's your model, here's the latency of each one of these hardware targets"", and we can compare TVM with other runtimes, like ONNX runtime, for example, and we're going to show you which one you should use, and you can choose based on that. Of course, we are working hard to improve the interface to enable users to make decisions about costs too, for example. You might want to get the highest throughput per dollar, for example. I would say that it's fair to say that models vary so much, that it's hard to say upfront, this is going to be the best. How you should run, run it through the Octomizer, get the most efficient version and binary of your model out, and then measure that. Lukas: Cool. Well, I guess that kind of actually leads me into the two questions that we always end with, which I want to give you kind of time to chew on. And I haven't asked you about a lot of your research. It seems super fascinating, but I guess I wanted to ask you, what do you think is a topic in machine learning that doesn't get enough attention? That if you had extra time to just work on something that you're interested in, maybe you would pick to go deeper on. Luis: Yeah. So I would say it's getting more and more attention now, but I've always been, and a lot of my research has been in automating systems design with machine learning and for machine learning. TVM is one example of using machine learning to enable better model optimization and compilation, but also doing hardware design and programming FPGAs for example, is really hard, and machine learning could have a huge place there. So let's say designing...what I want is really ""model in and automatic hardware plus software out"", ready to be deployed. I think that's one that I'm passionate about, and I think you can have quite a bit of impact precisely because you can reap the benefits in so many ways. You get new experiences because you enable new applications, but also make it more energy efficient. So I think we should actually always look at what is the energy cost of deploying this at scale, if it's going to be deployed at scale. Because in rich countries, you don't think about it. You just go pay the energy, even if it's high. But now if you really actually think about the environmental impact of running these at scale, it's something that one should pay attention to. Lukas: So this is actually using machine learning to optimize the model? Luis: Using machine learning to optimize, not just the model, but also the system that runs your model, such that you get better behavior out. They can be faster, higher throughput per dollar, but also much lower energy use. And I think it's definitely incredibly exciting and possible to do. So that's one of them. Now, let's see, one that doesn't get as much attention, but now it's getting more attention, that's dear to my heart, I will now touch into, is the role of machine learning in molecular biology. Lukas: Oh, right. Me too. I totally agree. Luis: So as part of my research personality, for the past six years or so, I've been heavily involved in an effort to design systems for using DNA molecules for data storage and for simple forms of computation. Some of it is actually related to machine learning. For example, we recently demonstrated the ability of doing similarity search directly as a chemical reaction. And what's cool about that is that, not only it's cool, definitely pushing as a new device technology alternative that's very viable and has been time-tested by nature- Lukas: Time-tested for sure. Luis: Yeah, it can be extremely energy efficient and fundamentally, the design of molecular systems is so complex that I cannot imagine any other way to design them than using machine learning to actually design those models. And we do it all the time. Like we had a paper that you might find cool late last year, it was in Nature Communications called ""Porcupine"". And we used machine learning to design DNA molecules in such a way that they look so different to a DNA sequencer, they're not going to be natural DNA. You can use this to tag things. We designed these molecules, you can go and tag ""arts"" or tag ""clothes"" and so on. Basically you take a quick sample, you run through a sequencer and you can authenticate that based on these molecular traces. But that was made possible because of machine learning in designing the molecule and actually interpreting the signal out of the DNA sequencer and so on. I feel this space...it's not fair to say it's not getting enough attention, I think it's getting more and more now precisely because of the pandemic and all of the other reasons why molecular biology matters. But I find it incredibly exciting and it's a lot of the high-level motivation for things that I do both in research and in industry, is enabling use cases like that. So the things that requires so much computation that wouldn't be possible before without a very efficient, very fast system. Lukas: Cool. I guess the question we always end with, which you've touched on a lot in this conversation is really, what you see the big challenges are today of getting machine learning working in the real world? Maybe when you talk to your customers and they sort of optimize their models, what are the other challenges that they run into when they're trying to get their optimized model just deployed and working for some end use case? Luis: I devote a good chunk of my life into deployment, in automated engineering involving deployment, but I don't want to sound too self-serving to say that's the biggest problem. I think that's a huge problem. It's a huge impediment in terms of skill set because it requires people that know about software engineering, about low level system software, and know about machine learning. So that's super hard. That's one, definitely getting them all ready for deployment. But then there's other ones which is just making sure that your model is behaving the way it's expected, post-deployment. Like observability, making sure that there's not unexpected inputs to make your model misbehave, have fail-safe behavior, and so on. I think that's one that is no news probably to this community, that some applications require...either because it's the right thing to do when a model is making decisions that are super important, you want to understand how they're done, and making sure that they actually hold in unexpected inputs. So I think that's one of the harder ones, because like any engineer that's thinking about the whole system, you to think about the weakest link in the system failure. And not worry that if you don't do something proactively, the weakest link in these systems, they're going to start being the models that you can't really reason them in a principled way. Lukas: Yeah. Awesome. Well, thanks for your time. This was a lot of fun. Luis: Of course. Thank you, Lukas, this is awesome. Yeah, I enjoyed it immensely. Thank you. Lukas: If you're enjoying Gradient Dissent, I'd really love for you to check out Fully Connected, which is an inclusive machine learning community that we're building to let everyone know about all the stuff going on in ML and all the new research coming out. If you go to wandb.ai/fc, you can see all the different stuff that we do, including Gradient Dissent, but also salons where we talk about new research and folks share insights, AMAs where you can directly connect with members of our community, and a Slack channel where you can get answers to everything from very basic questions about ML to bug reports on Weights & Biases to how to hire an ML team. We're looking forward to meeting you.",9604 +Matthew Davis — Bringing Genetic Insights to Everyone,https://www.youtube.com/watch?v=A0_b7pwKzmM,2582,2021-06-17,"Matthew: There's lots of genetic defect in everyone. We think the average healthy person has a couple of hundred genes with defective function in their genome. And there's probably lots of what we think of as subclinical symptoms wandering around, across the whole population. So I think that's the future vision of the company is everyone would benefit from having a full understanding of their genetic background. And it's a complicated problem that the doctors don't understand. The patients certainly don't understand, and that we're really at the frontier of understanding. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. Matthew Davis is the Head of AI at Invitae, a medical genetic testing company. He applies a really wide range of machine learning techniques to the genetic testing problem, which I think is one of the most interesting applications of ML today. I'm super excited to talk to him. Invitae is actually a household name in my house because my wife runs a start-up that sells Invitae and I run a startup that also sells Invitae. So we're one of the very few overlapping customers. So I feel like I know Invitae very well, but I was thinking if that wasn't the case, I would definitely not know Invitae. So I was wondering if you could describe what Invitae does and how I might interact with your products as a consumer. Matthew: Yeah, sure. So for starters, we're a medical genetic diagnostics company and pretty sure by volume of tests, we're the biggest in the world. We're- Lukas: Which is amazing, because you're fairly new for a public company. Didn't you start in 2010 or something like that. Matthew: Yeah, that's about right. So the company itself is about a decade old and that's not a coincidence because the availability of high throughput, low cost to genomic sequencing really came online in 2008 with Illumina making it a scalable platform. And at that point it became clear that instead of analyzing one or two genes at a time, that you could be analyzing lots of genes for less money. And I think the strategy that was clear to the founders was, it's a very narrow market with a very high margin that actually should be an addressable market of everyone with access to modern medicine. And instead the cost could be low and the volume could be high. And that if you pursued that strategy, there was actually way bigger benefits to mankind and also shareholders of a company, because you'd start to learn things about medical genetics, and disease, and probably most importantly relationships to treatments that you weren't going to learn if you took a small, addressable market strategy. So that was the vision- Lukas: I think you live in this, but for someone who hasn't had a genetic test for a medical reason, what would be a scenario where you'd actually want that, and what would it do for you? Matthew: Yeah. So classically, diagnostics were not about genetics. They were about your cholesterol is high or some other hormone is low or whatever. Genetic diagnostics were first proven en masse at breast cancer where we know there are genetic predispositions that would change your treatment strategy. Where the risk is high enough, that if your mother or your sister or your grandmother, your aunt had breast cancer at age 70... People get cancer when they're old. But if it's at age 35, that's way scarier. And when we added genetic analysis on top of that, we could further partition that to, ""Oh, because they had this variant, they were early cancer patients."" And then your doctor can help you make the best practice decisions about how to avoid it. At the extreme level that's a prophylactic mastectomy, but there are lots of other intermediates including like which type of drug would be most effective for you. Should you risk the downsides of chemo? Should you just wait? So it worked with breast cancer. And then as we started being able to analyze more and more genes, we started discovering more and more things that it works for. I think one thing to keep in mind is that human geneticists, in general, they historically study horrific things. Like when you're studying fruit flies or mice or humans, it's some big effects. There's this old saying, big mutations have big effects. Meaning we study things that could give you a heart attack at an early age or make you grow tumors or have crippling nervous system diseases. And there are lots of people at risk for that who don't know it. But I think the real future is there's lots of genetic defect in everyone. We think the average healthy person has a couple of hundred genes with defective function in their genome. And there's probably lots of what we think of as subclinical symptoms wandering around, across the whole population. So I think that's the future vision of the company is everyone would benefit from having a full understanding of their genetic background. And it's a complicated problem that the doctors don't understand, the patients certainly don't understand. And that we're really at the frontier of understanding. So that's a little bit of the history and a little bit of the future mission statement. Lukas: And so I'd imagine that a lot of people listening to this have done one of the consumer tests of maybe Ancestry or 23andMe. So how is what you do different from what happens there? Matthew: Yeah. I mean, it's a great question. And it's one that pre-COVID, riding on a plane, someone asks you what you do. And they're ""Oh, like 23andMe"". And you're ""Ah, man, no"". I mean, the obvious difference to the interaction with our customers is that historically it goes through a doctor. It's a medical test and you want that provisioned and administered by a medical professional. And 23andMe, it's a fascinating company, but they've focused on things whether or not you like cilantro, not whether or not you're at risk for disease. And they have tried to move into a diagnostic space, but they're not built for that. And we finally acknowledged that a couple of years ago, after many years of not wanting to offer a medical diagnostic procedure directly to patients, because we didn't want people to go with information in hand, but not understanding an explanation that they could get from a medical caregiver. So that's the real differences, like we have medical caretakers in place, but we now have a strategy where we let patients initiate their orders. And that's really because there's a lot of the country that doesn't have access to one of the few thousand genetic counselors in the US, so there are places where it's a six months wait. There's places where you're just not going to go. And thanks to telemedicine, thanks to software engineering, it's easier now for us to let a patient whose mother had breast cancer start the process themselves. We refer them to a telemedicine genetic counselor who helps them by being their medical caregiver, without them having to wait six months to go to a medical center that's a hundred miles away. And we think that's great. Does it lead to a little more confusion? Well, we used to say, we're not consumer-facing, and now we have patient-facing order forms, but that's really the difference. We are trying to help people with a complicated medical problem, not find out their ancestry, unless of course your ancestry has direct bearing on the medical risk. Lukas: Got it. That makes sense. I mean, how does AI fit into this? You talk about a broad range... When I've talked to you in the past, I've been shocked by the number of different ML fields that you draw from. So I was thinking maybe you could give me an overview of the different problems, where machine learning techniques can help with what you all are doing. Matthew: Yeah. So we say AI and it was an easy-to-adopt term, especially for the last few years, but when I think of AI, I really think of every chapter of a textbook of a field of computer science, it's been around for many decades. And a lot of it, thankfully is machine learning and some of it's not. So some of it is optimization algorithms and robotic planning and so forth. It has been around for a long time and it's still making rapid advance in those fields, but maybe there's a little less well known to machine learning folks. And then a bunch of it is machine learning approximations that can make a problem tractable that wasn't tractable before. So I mean, the actual applications. We have a key scaling problem. In a lot of ways we're a manufacturing company and our volume tends to almost double every year. And we have a laboratory that has to run assays with actual robots. We have rather complicated standard operating procedures and business process models that need careful execution and not to mention audit logging and accounting stuff. It's the medical field, so we have to be compliant and follow not just HIPAA laws, which are complex, but also contractual obligations to insurance companies and things like that. There's a lot of complicated process modeling. And then there's a lot of knowledge worker problems. So we have on staff dozens of PhD geneticists and biologists who have done this Herculean task of curating the medical genetic literature for any scrap of evidence that could inform whether this variant that seems to be breaking the function of a gene and a patient could actually be the causal factor that puts them at higher risk, because it's still unknown. Most of the variants that are analyzed in medical genetics, we're still uncertain what their eventual effect would be. That involves literature mining, all the most contemporary NLP methods for entity extraction, relationship modeling, linking ontologies. We don't get into things like summarization because even the fanciest most expensive model, it's not confident enough to write a medical report for someone. But the sort of language modeling that goes into something like GPT-3 we can use that for concept embeddings for extraction, for classification, for recommendation engines. So we have a lot of that NLP work that a lot of the rest of the world thinks of, and then we've got a fair chunk of computer vision problems, whether they're things like document processing, computer vision, or they're computational biology problems. And about half of my team is devoted to more core research advancing future products, doing academic collaborations with folks. So they're really trying to struggle with the problem I stated before, which is geneticists traditionally focus on big diseases with big mutations, but there's a lot more subtle signal going on for almost everyone on the planet. And, in a sense it's a signal detection problem. It's really a high order of complexity. It's silly to think of it this way, but if you just imagine we have 25,000 genes working in combinations or anything, how do you search a space of 25,000 factorial combinations? So the hope is that things that were completely intractable before by enumeration could be tractable by approximation. And so that's one of the great hopes for computational biology is that we can produce a search base with machine learning. We covered computational biology, knowledge, and operations. That's a big breadth of stuff to worry about. And then on top of that, I think there are things like graph embeddings for heterogeneous networks, where there's lots of reasons to believe that heterogeneous entities out in the literature shouldn't be just treated as word tokens that you learn with a language model. But instead you can layer on causality and known relationships. Biology is this kind of fascinating field because if you really cared about Newtonian mechanics, then you probably don't need a neural network approximator to tell you how fast the ball is going to roll down the incline plane with a certain coefficient of friction and whatever right, because you can physically model it really accurately. And in biology, if you open a biology textbook, there are all these cartoons of ""This protein binds to this protein and they both bind to the DNA, and then the RNA is made"" and whatever. And they're not just cartoons that you've memorized when you're a biology undergrad, they're actual physical models of material process of the universe. But the uncertainty is way higher. They are rough drafts, and because it's a tiny little submicroscopic machines, historically we don't just take the picture. I guess that's less and less true, because electron microscopy is now getting really good, and x-ray crystallography in some ways is really good at that. But for the most part, you do it by inference. You do some experiment and the readout is like, you look at a different colored of like, jelly of agarose in a tray, and it's all by inference. So when you look and see one of those CSI TV shows, and they're looking at the big bands of DNA, that's a very abstract version of the actual physical process. And that's where it's great for machine learning because there's enough structure to that cartoon that you don't have to imagine every possible force vector. You have some constraints, but it's uncertain enough that it's not Newtonian mechanics. So modeling it with uncertainty and then using those indirect observations to guide your search, in a lot of ways, it's a perfect field for using model-based machinery. Lukas: Well, okay. So right, I'm taking like a mental note of all the different applications that you mentioned. I have so many questions on each one, but maybe we should start with the last one, because it seems very intriguing. Why would a company like yours care about modeling the chemistry of molecules? What does that do for you? Matthew: Yeah. So, I mean, we know that if you put a change in this DNA sequence, that there's a high likelihood that it's going to change what amino acid is put in the protein, very predictably. We can predict that from basic biology knowledge, but we don't necessarily know that's going to affect the function of protein. And the easier ways historically to make computational estimates of that, were ""Compare the sequence of that gene across a thousand related species or 10,000 humans and see, is it always the same letter?"" And it's probably important because if it's not important then evolution will let it float around, but there's actually quite a lot of flexibility in those proteins where they're still functional. So there might be a subset of people where it's different and it doesn't actually matter. If you were an actual biochemist, then you might go do experiments in the laboratory, seeing how the proteins actually touch each other and discovering that the enzyme works better or it doesn't work as well. And it's really expensive and time consuming to do that at that slow process and scale. But if you had molecular models of those physical properties, then you could do in silico experiments and say well, I can't be sure that the enzymes not going to be as efficient, but based on a whole lot of- Lukas: You say in silico, you mean like, in silicon? I don't know, that's a new term to me but I love it. Matthew: Yeah. I mean, that's not me, that's biology. Biology loves Latin and so that's a well-tested phrase in computational biology for a long time, but yeah, that's the right answer. That's what it means. So you're doing the simulation and then you can say, ""Well with some certainty, based on the parameterization of those actual biochemical experiments that other people have done, this looks like a big change and therefore it's going to affect the function of the gene"". And therefore we have more reason to believe in a very Bayesian sense, our belief increases that this is the cause of someone's disease. Lukas: And is this something you'd really do in the future? Or is it in use now? Is this something that everyone would have to do to make a realistic model of... I guess how in use is this kind of modeling, for deciding what genes to look at? This is something you do every day? Matthew: I mean, this is definitely a thing that our company does. There's a team that does this and I think it's also an interesting example where I think it's a case where industrial research has more potential than traditional academic research, just because of the volume. The biggest academic collaborations for genome sequencing don't actually get to the same number of people as come through our samples. And they're not as enriched for people who actually have disease. Like the big population, genome sequencing centers in China, and the UK, and the US, they're not generally systematically going after people with disease. We have an ascertainment bias. It's actually a benefit if we want to study disease because people with disease in their families come through the door. And that means that we can do stuff that you can't do if you're working at the Broad Institute at MIT and Harvard or in Cambridge with European Bioinformatics Institute, like we have access to data that you can use these methods on that no one else can. Lukas: And how do you actually set up this problem? How do you formulate it? My mind is just how would I set this up as a machine learning problem that I could actually train on? Is it standard, like what the loss function is? What can you actually observe to put into this? Matthew: Yeah, right, I mean, I don't think there's a canonically true answer to that question, but we can talk a little bit about the pros and cons of the approaches. So, I mean, one thing is it's not a consumer recommender system where you recommend products and people click on them and buy them, or they don't. In fact, that's diagnostics, in general, has this problem with no ground truth. People die of symptoms on hospital beds and their doctors don't actually know in some sort of Plato, Aristotle sort of way, why they died. It's just that we have a stronger belief about the causality. So you can take a labeled dataset and say these people were diagnosed and they had that variant and I'll make a model that can predict that outcome with a supervised learning with it. But you're not actually dealing with ground truth because some of those people had the disease, they had the variant and they had the disease, but they actually had the disease because they smoked cigarettes for 50 years or because they were 90 years old or some other confounding factor was there. So if you want to try and think of it as belief, then you can go down to Bayesian probabilistic graphical model, causality, Judea Pearl-explainable AI path that I think people are excited about talking about, but you have to know a lot of human knowledge goes into that. And it's not as simple as I have some labeled data and I'm going to train an arbitrarily deep neural network to approximate the soft max or something. So you end up working a lot with ""How do I take those physical models of what I think is going on in biology?"", and you're trying to design the algorithm to do that, whether it's with causal graphical models or it's just knowing, ""From these feature vectors I can learn an auto encoded representation that should in theory account for these factors we know from the physical model are important"". And then I'm going to let the neural networks set the weights by showing it what are the observations that are the closest to ground truth people will ever have. Lukas: But it sounded there's sort of subproblems here that people work on. Like you talked about people looking at proteins and mixing them together and seeing what happens. Is that a subproblem of this bigger problem where you could have different observations and build a different model around it? Matthew: Yeah. Right. So, I mean, we do have a wet lab team that is collecting basic molecular biology data. But one of the awesome things about biology is that lots of other people are doing that. So lots of professors and lots of universities and lots of their grad students are collecting and publishing data in a way that is ingestible for us to learn some of those things. But there are places where you may identify key deficiency in that knowledge that are like well, it's worth it for us to do this experiment because it would really help us parameterize what we think is missing in this model. And so then you have from an industrial research perspective, you have to think about the cost benefit. Is it worth, spinning up a wet lab initiative to do stuff that's hard to do at scale? It's a feeling that you want that feature vector, it better be worth it because it's not cheap. Lukas: Sure, sure. Although it sounds like it's evocatively similar to exploring a space of hyperparameters for you know... Matthew: I mean, maybe it's more if you had a product recommender and you knew everything that everyone had ever clicked on, but it just still doesn't seem to have that much accuracy. So you send out some design researchers to talk to your customers and sit in their house with them and talk to them like, ""Oh, that's weird. Everyone's buying Adidas. I didn't notice that before. Is that in the model? Let's go find out all the shoes everyone buys and then see if that improves the accuracy."" Lukas: Got it. That makes sense. Matthew: The thing is if you had the whole life history of me as an individual and everything I'd ever done, then you might be able to start down the path of that modeling, but that's crazy. No one would do that. You just look at my ad click data and make some recommender that if it has 38% accuracy is going to make a bunch of money for an ad company. But if you're talking about someone's health and complex things like biology, then you want it to be higher accuracy and you got to go actually model stuff out deeper. Lukas: So I guess another whole field that you talked about doing is sort of the, what do you call it, sort of medical NLP or bioinformatics. I mean, this is one thing I've been curious about, you've seen a lot of progress, very visible progress in NLP, notably GPT-3, but also these word embeddings becoming super popular. Has that influenced bioinformatics? Does that directly apply? Can you fine tune these models on medical text domains? Or what's the state of the art there today? Matthew: Yeah. Right. So, I mean, I think there's two big problems in industry that people would love to solve. One of them is comprehending medical records and the other one is comprehending the medical literature. When you state the problems, they sound the same. It's like, I went to extract the entities and map the relationships and then link them to ontologies so that I can structure the data and then make queries over it. And if you can do that, then the practical challenges are things like ""Can I show to one of our clinical scientists the right piece of literature at the right time to help them make the right insight about this genetic variant that's never been observed in someone before?"" And then if you look at the medical records, it's how do I take this allegedly structured unstructured data and turn it into something that's actually structured so that we can make trajectories of people's disease progression or predictive risk. And so it turns out that training language models on Google books and New York Times articles and Wikipedia does not actually help that much, but also surprisingly, several years ago, I did some experiments when I used to be at IBM research where I had a research group. We did some experiments where we had domain specific corpuses and general corpuses. We would train the same models and to my surprise, the bigger general corpus helped more than the specific corpus. And that was an early transfer learning insight. It was like, ""Take the biggest corpus you can get"" and then transfer learning is a good idea. What's hard is, the concepts are the same to humans, but when you look into a medical record, it says MGM, BRCA and that means maternal grandmother had breast cancer. And you look in the medical literature that's published academically, it doesn't even talk about the relatives, and it doesn't even say ""breast cancer"". It says ""malignant neoplasm of the tissue"". You don't even know it's talking about the same thing. So mapping the concepts across this is tricky and just the syntax, right? The medical abbreviations, and it's almost like it needs its own language model. So I mean those are some of the hard problems for contemporary methods to actually work on, especially out of the box. So we do take encoder-decoder based sort of Transformer models and adapt them pretty readily with supervised training. And it's definitely better than starting from scratch, but it still requires domain experts labeling stuff to get there, or it takes some weak supervision, data programming sort of methods where people were writing rules that make a lot of sense to weekly label the data. And it's not as good as human-labeled expert data, but you can bootstrap yourself into having a better data set to train on. So some of those methods work really well in biology. Lukas: That's really interesting. Do you have the sense that, well I don't know, I have the sense that recently NLP methods have improved a lot. When I look at scores that I'm used to from a decade or two ago, they just seem much better over the last couple of years. Has the same thing happened in the medical field? Matthew: Kind of. So if you take the scores on question-answer datasets, some model is better at answering Stanford question-answer questions than I am, right. Super, very impressive. But I don't think you would expect the same thing to be true with a medical question-answer and a bunch of specialists, doctors in whatever domain. So no one expects a chat bot powered by a GPT-3 to be better at giving medical advice. But that said, the language model that's learned could be extremely useful for facilitating a human expert. And so I think that's where the hope is at this point, it's just like AI assistant, better information retrieval, better support for the expert is the current hope. Lukas: Got it. I guess a general problem that a lot of people ask me about, and I know a lot of people listening to this would wonder about is, how do you think about structuring your team? You talked about half the people doing core research, but then also it seems what you're doing is very connected to what the company is doing. Do you try to literally separate the people that are doing the applied stuff and research stuff? Or do you separate it by the field of work? Or how do you think about that? Matthew: Yeah, it's a really good question. And I think I suspect the answer that I have today will be different than the answer I have in a few years, which I know is different from the answer I had a few years ago. And it feels one of those things that we'll keep reinventing. We'll keep reinventing how to deploy software and we'll keep reinventing how to provision infrastructure. And we'll come back to the same basic principles that people thought of a few decades ago, but we'll keep refining it. And so right now, the company is effectively, the company still works like a startup with very clear product driven vertical teams. And the idea that we want to imbue a machine learning capability to the company, it's hard to figure that out. And it's a little different if you're a Google and well, the company is built on machine learning based information retrieval. So we expect everyone to take a machine learning approach to something. So I guess the direct answer is we have a functional team. Everyone goes to meetings together, hangs out together, checks in together, but people have different projects. And it has definitely been hard for some of the team members. A common source of feedback is ""I don't know what everyone else is doing because everyone is working on something else. I'm used to working with like four people on a specific project. And we talk every day in a standup meeting, and in this team everyone's doing something different."" And the people on the team who went to grad school and experienced what that's like to get a PhD where it's ultimately up to you to do your thing, they're more comfortable with it because they're like, ""Yeah, of course we're all doing our own thing."" In reality, I really hope everyone is not doing their own thing. I hope that there is cross-fertilization and support and it's inherently matrixed. But the goal is, we reserve some of the people's time for research because if you don't explicitly set aside the commitment, then it will be absorbed by whatever demand of the product team in the short term. And then we set aside some people's time to develop platforms that are modular and reusable with the hopes that we continue to imbue that throughout the rest of the engineering teams. And then we set aside some people who are then functionally assigned to specific engineering projects, whether it's to realize one of the research projects into production or it's to leverage one of the platforms for a problem, or maybe it's just someone has a pretty straight forward problem, and they need a scikit-learn model, and it's going to take someone an afternoon to prototype and three weeks to get it to production, so we stick someone in there for a sprint or two and make sure it happens. I guess in some sense it's a very zone defense strategy, it has to be flexible. Lukas: Right. And do you then hire people who have sort of knowledge of multiple topics, they seem such deep fields that are different. Is it possible to find someone that knows about multiple of these applications? Matthew: Yeah. So we hire people with specific expertise, for sure. I am actually just extremely fortunate. I was a software engineer who went to get a graduate degree in computational biology at a time when doing that probably also meant that you're going to do biology. And then I went and worked at IBM in a research division with just this huge diversity of industrial interests. So I was exposed to lots of different AI methods and that was not something that I knew was going to happen to me, but was really fortunate. And what that means is I met people in different industries at different conferences to understand, ""Oh, there's this boutique thing that was popular two decades ago, but continues to be a core technology for NASA or Toyota. And not a lot of people pay attention to it, but man, it can solve a lot of problems."" But you're just not going to find a Coursera course on [it]. So it's great because we can find people with that expertise. And if they're CS PhDs, kind of fortunately, right, so Computer Science PhDs are generally interested in stuff. If you practiced NLP algorithms, you're probably still interested in plain computer vision. So I think that's a fortunate thing. I can find someone with the expertise in information retrieval and they can still make really meaningful contributions to other types of problems and other subject domains. One of the harder things is getting the biology knowledge solid enough that they can talk to the biologist and the other stakeholders and quickly understand the problem statement. Lukas: That makes sense. And I guess, one of the things that you talk about a lot, I think is the importance of engineering to making all this stuff work. Do you hire just pure engineers on your team or do you rely on outside teams to provide that? Matthew: Yeah, I know, I think it's really important. I mean, on the research projects, it's really important to be able to prototype things because... I hope your listeners find me eloquent, but my experience in life is, I may have some beautiful complex system in my head and I have a very little ability to communicate it to other people's brains, and building a prototype really helps. And you need a diversity of skills and even a small team to make that happen. It's just a waste of everyone's potential to ask the algorithms expert to write some React front end that you're going to throw away after you show it off to a stakeholder. So better to have a JavaScript programmer around here for that- Lukas: Do you have a ratio that you shoot for? I'm just curious about this as sort of algorithms to implement or something. Matthew: I think it just depends, but we've tried to maintain a bench of depth so that we can recombine it. I think a lot of really high impact projects can be done with one algorithms person prototypes a thing, hands it off to one or two engineers who implement the thing. And then we further hand it off after it's implemented to an engineering team that's going to love and care for it in the long term and maybe come back to us if they need new features, but to them it looks software that could have come from anywhere. Other projects you need... We have some of our more challenging algorithmic problems where we'd like the approach of probabilistic programming and there's not a lot of mature frameworks out there for that, like Google and Uber AI, both popularized some, but you need some pretty heavy lifting on algorithm development, some fearless backend engineering chops to make anything happen. And then once you have the ability to make anything happen, then you also want to layer in the computational biology expertise to make sure the right modeling steps that I described before is happening. So that could be a several person team just to make the prototype, because it's complicated and the tooling requires help. And it's not as simple as a web backend and a React frontend or something. Lukas: One of the things that I've been noticing, at my company we've seen more and more interest in customers coming in from pharma and the medical stuff. And it always feels to me like, of all of our customers, it's the biggest culture clash. Basic stuff that I feel I haven't discussed in a long time, they'll be suspicious of open source software, and so I'm like, ""Oh my God, is it like 1995?"" Does that happen at Invitae? Because it's sort of a newer company and sort of more maybe CS-focused or do you also feel that working with biologists? Matthew: No, I don't think it's a problem here. And certainly I saw that problem with IBM customers at times. I was lucky working at IBM, they were huge investors in Linux, 20 years ago. And it was clear to everyone why that continued to be the case, but I would see it from other companies who were like, ""I would prefer the lower performance, more expensive proprietary thing, thank you."" I mean, I think one of the virtues of Invitae is it does have a Bay Area ethos and ""Get there faster, get there cheaper"" is a good idea. Lukas: Totally. Matthew: So I don't think there's any skepticism there, but sometimes you collaborate with the insurance agencies, or the insurance payers or Medicare and then you're into a whole ballpark of... It's not even individual skepticism. And it's not even institutional skepticism. It's codifying contracts. Yeah. So, I mean for us, it's not a problem at all. I think I would imagine the bigger the company, the older the company, it's probably true in every sector, but a lot of the big old companies in technology got over it a long time. Lukas: Right, right. Yeah. That makes sense. Well we always end with two questions. I want to make sure we have time for them. They're broad and feel free to expound a little bit, but one thing we always ask people is, when you look at what people are seeing or what people are doing an ML, what's a topic that you think people don't pay enough attention to maybe a skillset that you'd like to hire for, but nobody's studying or something that you'd to spend more time on if you could. Matthew: Yeah. I mean, so just reasoning in general. And I think this happens. If you go to a general AI conference, whether it's one in recent favor, like ICML or NeurIps, or it's the oldest, AAAI sort of standard conferences, keynote speakers will talk about fast and slow AI, or a system one and system two or whatever. But I think no one ever actually wants to do reasoning, because it's so hard. But then you see communities of self-flagellating academics, lamenting that they're only competing to get a higher F1 score on some published data set that's been around forever and what's the actual use of it all. And I think this conversation is also often turned to ""Oh, if we were doing some more complex reasoning thing than it would be more valuable for mankind"", but it's just hard. So that's why I said earlier, we're into the probabilistic programming ideas because you can take a causal graphical model that can be highly explainable and you can not have to Monte Carlo sample it until the end of time. Thanks to variational inference and frameworks like Edward and Pyro that make it easier. I think that's going to push our ability to reason about really complex things and bring human expertise in and let people help correct the models and do a lot of the things that would just, frankly, feel like people talk about doing, but are hard to do. I think there's also a bias against systems at academic conferences. No one wants to write a quote unquote systems paper in a workshop. They want to write an algorithms paper, that's going to get cited 10,000 times. But, that work is probably more important. Or like, putting together a thing that solves a problem is really valuable. And I wish we trained grad students to think about that instead of to think about hyperparameter tuning effectively. And that's if I could snap my fingers and change one thing about this field, I think that would get us like ""Pay attention to complicated systems, because it will help you build things like reasoning engines"". Lukas: Interesting. I guess I have not made a connection between reasoning engines and systems. Those two both seem separate tracks. Is there something about making working systems in your experience that really requires reasoning? Matthew: So if you take an example, the word embeddings or graph embeddings, once you have that representation of similarity, you can rank documents and calculate a F1 score, but you can also give it to an expert and say, ""I found this thing for you. Do you think it's the right thing or not?"" And if they say yes, you can further process it and extract some more information out of it for a specific purpose. And if they say no, then you could ask them why not? And reason about the introducing of relationships you've extracted and actually auto refine your model from the feedback of the user, but that's a HCI problem. That's an interaction problem that you're not even going to start to touch, unless you're open to the idea of building some boring system that ties together a user interface and some backend systems that are not all machine learning. Lukas: Totally. Okay. So final question is, so when you look at, in your career taking stuff from the sort of prototype version to deploy it in the real world and useful, where do you see the sort of biggest bottlenecks or biggest problems? Matthew: I think the biggest fundamental problem is when you work in an industry and you have an existing company, but probably also when you have a startup and you're trying to get funding for it, to have buy-in from the product philosophy, from the outset. And to have some willingness that the prototypes might not work. So you need a foundational, definitely going-to-work plan to make a product. But to have a ""I'm going to reserve 20% of the resources to try this crazier thing, we'll prototype it, and if it works, it'll be great"", you gotta have the person who's going to take it to market care about that idea. When you have a bunch of researchers hanging out, making cool prototypes, and then they take it around like a toddler who made a thing, and they're like, ""Oh, look at this thing I made you, don't you love me?"" I think almost every researcher I've known in industry could identify with what I just said as the toddler, because we all think we have some brilliant idea and we make a thing and we take it to people and they're like, ""I'm sorry, I have a deadline right now. I don't understand, I already have a thing that recommends papers. I think it uses a regular expression?"" They don't care and they don't see the value. So you have to really get the buy-in at the beginning or you can spend a lot of time making a hard thing, and probably an expensive thing, happen and then it doesn't actually go anywhere. And it's more emotional than strategic. You have to be open to the idea that they might not see the value in what you want to do, and that helps you prioritize what to do. Lukas: Interesting. We've not heard that answer yet, but that really resonates. That makes a lot of sense. Thank you so much. This is a lot of fun. I really appreciate your openness. Matthew: You're welcome. Really appreciate it. Lukas: Really appreciate it. Thanks for listening to another episode of Gradient Dissent, doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to these episodes. So if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",7302 +Clément Delangue — The Power of the Open Source Community,https://www.youtube.com/watch?v=SJx9Fsnr-9Q,2795,2021-06-10,"Clem: I think through the open source model, you can do things a bit differently with kind of the inspiration of open source for infrastructure and database. With companies like Elastic, MongoDB that have shown that you can, as a startup, empower the community in a way and create a thousand times more value than you would by building a proprietary tool, right? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host Lukas Biewald. Clem Delangue is CEO and Co-Founder of Hugging Face, the maker of Hugging Face Transformers library, which is one of the most, maybe the most exciting libraries in machine learning right now. In making this library, he's had front row seats to all the advances in NLP over the last few years, which has been truly extraordinary. And I'm super excited to learn from him about that. All right, my first question is probably a silly question, because almost anyone watching this or listening to this would know this, but what is Hugging Face? Clem: We started Hugging Face a bit more than four and a half years ago, because we've been obsessed with natural language processing. The field of machine learning that applies to text, and we've been lucky to create Hugging Face Transformers on GitHub that became the most popular open source NLP Library, that over 5,000 companies are using now to do any sort of NLP, right? Information extraction, right? If you've a text you want to extract information. Platform like Chegg, for example for homework, is using that to extract information from homeworks. And you can do text classification, we have companies like Monzo, for example, that is using us to do customer support emails classification. They receive a customer support email, does it relate to which product team for example, is that urgent, not urgent? To many other NLP tasks like text generation for auto complete. Or really kind of any single NRP tasks that you can think of. And we've been lucky to see adoption not only from companies, but also from scientists which have been using our platform to share their models with the world, test models of other scientists. We have almost 10,000 models that have been shared, and almost 1,000 datasets that have been shared on the platform to kind of help scientists and practitioners build better NLP models, and use that in the product or in their workflows. Lukas: And so Hugging Face Transformers is the library that's super well-known, right? And then the platform is a place where you can go to use other people's models, and publish your own models. Do I have that right? Clem: Yeah, exactly. With a hybrid approach to building technology. We feel like you need kind of the extensibility of open source, and practicality of, for example, user interfaces, right? We cover really kind of the full range, meaning that if you're a company, you can do everything yourself from our open source, not talk to us, not even go to huggingface.co, do everything from pip install transformers, right? If you want a bit more help, you can use our hub to discover a new model, find a model that works for you, understand these models. To even in a more extreme way, if you're a software engineer, or if you're new to NLP, or even new to machine learning, you can use our training and inference APIs to train and run models. And we're going to host this inference and this training for you to make it very, very simple so that you don't have to become an NLP expert, to take advantage of the latest state of the art NLP models. Lukas: That's so cool. I mean, I want to zoom in on Hugging Face Transformers first, because it feels like it might be one of the most popular machine learning libraries of all time. I'm kind of curious what you attribute to that success. When did you start it and what were you thinking, and what did you learn along the way? Clem: I mean, it may be, I don't know if it's the biggest machine learning open source. It's definitely the fastest growing, because it's fairly new. We released the first version of it two and a half years ago, which is not a long time ago in the grand scheme of open source, right? Lukas: Yeah, for sure. Clem: If you look at all the kind of most popular open source, you see that they usually need a very long time of maturation, right? The grand scheme of open source Transformers is very much still a baby, but it grew really, really fast. It really blew up with over 42,000 GitHub stars, over a million pip installs a month. I think we have 800 contributors to Transformers. And the main reason why I think it's successful is, to me because it really bridges the gap between science and production, which is something fairly new and that not a lot of open source and not a lot of companies manage to do. I strongly believe that machine learning compared to, you can call it software engineering 1.0, or software engineering, or computer science, even if computer science as science in the name of it, it's not a science-driven topic, right? If you look at a good software engineers, they don't really read research papers, they don't really follow the science of computer science. Machine learning is very different, it's a science-driven domain, right? It all starts from couple of dozen kick-ass kind of NLP science teams all over the world that are creating new models like, BERT, T5, RoBERTa, all these new models that you've heard from. And I think what we managed to do with transformers is to give these researchers the tool that they like to share their models, to test models of others, to go deep into kind of the internals of the architecture of these models. But at the same time create an easy enough abstraction, so that any NLP practitioner can literally use these models just a few hours after it has been released by the researchers, right? There's some stuff like a magic, some sort of like network effect, or some sort of magic when you bridge the two. We don't understand all the mechanics of it yet, but there's some sort of a network effects for it each time there's a new model released, like the researcher is releasing it within Transformers. People are hearing about it, they're talking about it, they want to use it, they test it in Transformers, they put it in production, it works. So they want to support it more. The scientist is happy that his research is seen, is used, is impactful. And so they want to create more and they want to share more. This kind of like virtuous cycle that I think allowed us to grow much faster than traditional open source. And that kind of struck a chord on the market and on the field of machine learning. Lukas: I guess as an entrepreneur, I'm always kind of fascinated by how these virtuous cycles get started. When you go back two and a half years ago, when you're just first starting the Transformers project, what was the problem you were trying to solve, and what inspired you to even make an open source library like this? Clem: I could probably give you a kind of like a smart thoughtful- Lukas: No, no, I want the real answer, tell me what's actually happening. Clem: The real truth is that we didn't think much about it. We've been using open source for a while. We've always felt like in this field, you're always standing on the shoulders of giants of other people on the fields before. We've been used to this culture of when you do science, you publish a research paper for research in machine learning, you even want to publish open source versus in the paper, right? And so since day one at Hugging Face, we've always done a lot of things in the open, sharing in open source. And here for Transformers, it started really simply, with BERT that was released in TensorFlow. And Thomas, our co-founder and chief scientist was like, ""Oh, it's in TensorFlow, we need it in PyTorch, right? I think two days after BERT was released, we open-sourced PyTorch BERT. That was literally first name of the repository. And it grew up, people started using it like crazy. And then a few weeks after, I don't remember what model was released. I want to say RoBERTa, but no, RoBERTa was much later. But another model was released maybe with GPT actually, I think it was the first GPT. It was released, and I think same thing, it was really just in TensorFlow, and we were like, ""Okay, let's add it."" And we felt like, ""All right, let's make it so that it's easier for people to try both, because they have different capabilities, good at different things."" We started thinking about what kind of abstraction we should build to make it easier, and very much like that, it went organically, and at some point researchers were like, ""I'm going to release a new model, can I release it within Transformers?"" And we'll say, ""Okay, yeah, just do that."" And they did that, and then kind of like a snowball, it became bigger and bigger, and brought us to where we are now. Lukas: That's a really cool story. I didn't realize that you were trying to put models from TensorFlow to PyTorch. I mean, now you work with both TensorFlow and PyTorch, right? Clem: Yeah. Lukas: Did you feel at the time, I guess, a preference for PyTorch, or why was it important to you two and half years ago to move something to PyTorch? Clem: I think the user base was different, right? We've always been passionate about democratization or making something a bit obscure, a bit niche, making it available to more people. We feel like that's how you get the real power of technology, is when you take something that is in the hands of just a few happy few, and you make it available for more people. That was mainly our goal, there are like 10 people who are using TensorFlow, there are people who are using PyTorch. We wanted to make it available to people using PyTorch. We were using PyTorch ourselves extensively. We think it's like an amazing framework, so we were happy to make it more available. The funny thing is that, as we got more and more popular at some point, we've seen the other movement in the sense that people were saying... At some point we were actually named PyTorch Transformers, and we starting having a lot of people working in TensorFlow was like, ""Guys, it's so unfair, why can I just use Transformers if I'm using PyTorch."" And so that's when we extended to TensorFlow, and dropped PyTorch Transformers, dropped the PyTorch in the name, and became Transformers to support both. It's been super interesting, because if you look at our integration of PyTorch and TensorFlow, it's more comprehensive, it's more complete than just having half of it that is PyTorch and half of it that is TensorFlow. You can actually kind of on the same workflow in a way on your same kind of machine learning workflow, you can do part of it in PyTorch. For example, when you want to do more like the architecture side of it, PyTorch is really, really strong, but when you want to do kind of serving, TensorFlow is integrated with a lot of tools that is heavily used in the industry. In the same workflow, you can start building your model in PyTorch, and then use it in TensorFlow within the library. Which we think is pretty cool, because it allows you to take advantage a little bit of the strengths and weaknesses of both frameworks. Lukas: Do you get a chance to use your own software anymore, do you build Hugging Face applications ever at this point, or you're just making these kind of tools for other people? Clem: Yeah, we play with them a lot. I think one of our most popular demo ever was something called Write With Transformers, which was some sort of kind of text editor powered by some of the popular models of Transformers that got I think, something over 1,000 books, the equivalent of 1,000 books have been like written with it. It's some sort of like what you have in your Gmail to complete, but except much more silly and creative. It works really well when you have kind of the syndrome of the... Can you say that English? Syndrome of the white page when you don't know what to write. Lukas: Oh, yeah. I don't think we say it like that, but I understand the experience. Clem: In French we say ""syndrome de la feuille blanche"", when you want to write but you don't know what to write about. It's helping you being more creative by suggesting long, interesting texts to it. Lukas: That's really cool. I wanted to ask you, I feel like you have a really interesting lens on all the different architectures for NLP. I guess, are you able to know kind of what the most popular architectures are? Have you seen change in that over the last two and a half years? Clem: Yeah, we do. Can see kind of the download, kind of, volumes of models. It's interesting to see, especially when new models are coming up to see if they're successful or not, how many kind of people using... Something that's been super interesting to us is that actually the number one downloaded model on the hub is DistilBERT, right? Models that we distilled from BERT. But there's also a lot of variety in terms of usage of models. Especially I felt like over the years they became in a way a bit more specialized, right? Even if they're still kind of general pre-trained language models. I feel like more and more, as each new model came with some sort of an optimization that made it perform better. Is it on short or longer texts, on generation tasks versus classification tasks, multi-language versus mono-language? You start to see more and more diversity based on what people want to do with it. And what kind of strengths and weakness do they value the most, right? A little bit like what I was talking about between PyTorch and TensorFlow. People are trying to not so much decide which modeling is the best, which is kind of silly in my opinion, but which model is the best for which task, for which context, and then pick the right tool for the task. Lukas: I guess, for someone listening to this who doesn't have an NLP background, could you explain what BERT is, and just what it does, and maybe how DistilBERT differs from them? Clem: The whole kind of revolution in NLP started with seminal paper called ""Attention Is All You Need"", right? Which was introducing this new architecture for NLP models based on transfer learning. BERT was the first kind of most popular of these new generation of models. And the way they work is, in a simplistic way without getting too technical, is that you pre-train a model on a lot of texts on one specific task. For BERT, for example, it's mask fitting and you give it sentences, you remove a word in the middle of the sentence, for example, and then you train the model on predicting this missing words, right? And then you do that in a very large corpus of texts, usually slice of the web, right? And then you get a pre-trained model that has some kind of understanding of texts that you can then fine-tune. Hence the name Transfer Learning, because you can go from one kind of pre-training task to other fine-tuning tasks. You can fine-tune this model, for example, on classification, right? By giving it a couple of thousands of examples of a text and classification for customer support emails that I was talking about, ""classification - urgent and not urgent"", right? And after that, the model is surprisingly good at classifying a new text that you give it based on urgency. And it's going to tell you, this message there's 90% chance, it's urgent based on what I've learned in the pre-training and in the fine-tuning. Lukas: For example, with BERT, I guess, you have a model that can fill in missing words. How do you actually turn that into a model that, let's say, classifies customer support messages? Clem: With fine-tuning, you fine-tune by adding a layer, you fine-tune this model to perform on your specific task. And that's more kind of long-term way. I think that's very interesting way of doing machine learning, because intuitively you almost feel like it's the right way to do machine learning, in the sense that what we've seen in past with machine learning and especially for startups, a lot of them have kind of sold this dream of doing machine learning, and doing some sort of data network effects on machine learning, right? Because there's this assumption that you're going to give more data to the model, and it's going to perform better. And I think that's true, but the challenge has always been that you have more data, and so your model performed incrementally better, but only on what you're able to do already, right? If you're doing time series prediction, maybe you have 1 billion data points, right? And your model performs at 90% accuracy, you add maybe 9 billion, 10 billion, additional data points, and your model is going to perform at 90.5% accuracy, right? That's great. I mean, that's good improvement, that's something you need, but it doesn't give the kind of increased performance that you're really expecting from a typical network effect, in the sense that it doesn't make your result 100X, 10X, 100X better than without it. With transfer learning, it's a bit different because you're not only kind of improving incrementally the accuracy on one task, you give it more ability to solve also tasks. So you actually not only increase the accuracy, but you increase the capabilities of what your model is able to do. I won't go into kind of the crazy Musk-type kind of prediction. But if you take actually Elon Musk, kind of OpenAI founding story, where he's saying like, ""We need to bring the whole community together to contribute to something open source for everyone"", intuitively you could think that could come with actually transfer learning in the sense that you could envision a world where every single company is contributing with their datasets, with their compute, with their weights, the machine learning model weights, to build these giant kind of open source models that would be able to do 100X more things than what each of these companies could do alone. I don't know if we're going to get there in the foreseeable future, but I feel like that's in terms of concepts, that's something interesting to look at when you think about transfer learning, as opposed to the other techniques of machine learning. Lukas: I guess, did you have a feeling about OpenAI, not releasing the weights for the early GPT models? Or I guess, any of the GPT models. Clem: Yeah. GPT, GPT-2, I think a couple of versions in between were open source, right? And it's in Tranformers, and we have a lot of companies using them. Probably more companies using GPT-2 through Transformers than GPG-3 today. They're private companies, so I totally respect their strategy not to open source the models that they built. They've done an amazing job with GPT-3, it's a great model for everything when you want to do text generation, it's really useful. I'm really thankful for all the work they've done democratizing the capabilities of NLP. As our goal is to democratize NLP, I feel like what they've done promoting it into more like of the startup community in a way. A lot of people realize too, if with the communication that you could do so much than what we've been doing so far with NLP, which is great. I think it participated to the development of the ecosystem and putting kind of NLP in the spotlights, which has been really great. And we see a lot of companies starting to use GPT-3, and then obviously it's expensive, it's not really extensible. You can't really update it for your own use case. It's hard to build some sort of technological competitive advantage when you build on top of an API proprietary, or API from someone else. We see a lot of companies using GPT-3, and then discovering NLP, and then coming to our tools. And the same way happens, I'm sure, the other way around. Some people start with our tools, are open source, and then they decide to kind of use something a bit more off the shel like GPT-3, or Google NLP services, or AWS Comprehend. Providing an API for NLP has been around from these companies too. I think everyone is part of the same ecosystem that is growing, so that's super exciting. Lukas: Do you feel like there's a difference in the GPT approach versus the BERT approach that you were talking about? I mean, GPT has been very high-profile, and the text generation is really impressive. Do you feel like OpenAI is doing something kind of fundamentally different there? Clem: Yes. They are both Transformer models, right? They're kind of same technique, with slightly different architectures, right? For example, when BERT is doing mask filling, GPT is doing language modeling. So, next word prediction, so it's a bit different, that's why the text generation capabilities are so much stronger. It has its limitations too, for example, if you want to do classification, you shouldn't do it with GPT, it doesn't make sense at all. They solve different use cases with kind of slight variations of the architecture. We've had people reproducing GPT, I mean, we've had GPT-2 and a team called Eleuther, I don't even know how to pronounce it, but released GPT-Neo a few days ago, which has the same architecture as a GPT-3, just with less weights for the moment, but they intend to kind of grow the weights. I think the size of their model is the equivalent of the smaller GPT-3 that OpenAI is providing through an API today. And it works well, it's interesting to see the power of the open source community. I think one of my fundamental conviction is that on a field like NLP or machine learning in general, the worst position to be in is to compete with the whole science and open source fields. Lukas: Sure. Clem: Just because I've been in this position before, actually the first startup I worked for, we were doing machine learning for computer vision back in Paris. I'm French, obviously, as you can hear from my accent. But competing against the science fields and the open source fields on such a fast moving topic is a difficult position to be in, because I think you have 100s of research labs at larger organizations, or at universities, that are not so much, kind of, potentially, each one better than what you can do at the startup, but just there are so many of them that when you can do just one iteration, you have 100 out there doing one iteration too. You can outpace them, and be the state-of-the-art for a few days, then someone who started just a few days after you is catching up, and then you're not kind of ahead anymore. We've taken a very different approach by instead of trying to compete, I think, with open source and with the science field, we're trying more to empower it in a way. And I think through the open source model, you can do things a bit differently with kind of the inspiration of open source for infrastructure and database, with companies like Elastic, MongoDB, that have shown that you can, as a startup, empower the community in a way, and create a thousand times more value than you would by building a proprietary tool, right? And that you don't have to capture 100% of the value that you create, right? That you can be okay creating immense value and just capturing 1% of it to monetize to make your company sustainable. And that can still kind of make a large public company like in the case of MongoDB for that. Both has kind of like this open source core, but at the same time can grow an organization and be sustainable. And I don't see why it should be different for machine learning. We haven't seen a lot of large open source machine learning companies yet. For me it's more a matter of how early the technology is. It's too early to have large open source machine learning companies, because I mean, five years ago, nobody was using machine learning, but it's going to come. I think I would wouldn't be surprised in five, ten years, you'd have kind of one, two, three, four, five, ten massive open source machine learning companies. Lukas: I guess, you've had really front row seats to the cutting edge of NLP over the last couple of years. Do you feel like the applications have changed with these models getting more powerful and useful? Are there things you see people doing now that you wouldn't have seen people doing three years ago? Clem: Yeah, honestly, I think out of the 5,000 companies that are using Transformers, I mean, the vast majority, I mean, it's hard to tell, but we see a lot of them that are using Transformers in production. And I would say that most of them weren't using NLP in production five years ago, right? A lot of these are new use cases, that either were impossible before, so the companies were just not doing it, or really were performed by humans, right? Moderation, for example, is a good example of that. Customer support classification as I was saying, it's replacing kind of a very manual process. Auto-complete is really, really big in Gmail. It's been my biggest productivity enhancement, I feel like in the past few months is using Gmail to complete basically write just half of my emails. Now, most of the search engines are mostly powered by NLP, by Transformer models. I know Google now is saying that most of their queries are powered by Transformers. Arguably it's like the most popular consumer product out there. I think it's changing so many products, the way products are built. I'm really [interested]...and that's why also seeing GPT-3 kind of promoting NLP into the startup world is super interesting. I think it's very game changer when you have companies starting, building products from scratch, leveraging NLP. Because I think you build differently, right? When you start kind of building legal...you can think of basically every company today. It's really fun to think, ""What if these companies started today with today's NLP capabilities?"" And you'll see that you have so many ideas for them to do things differently. You take like DocuSign, right? What if DocuSign with kind of analysis of documents starting today with NLP. You think Twitter- Lukas: Wait, wait, tell me about DocuSign. Because what I do with DocuSign is I get like a message, and then I click sign, and then I sign the thing. What would be different about DocuSign if it started with all the technology available today? Clem: I don't know. It would give you so much analysis of the... There would be a ""too long; didn't read"". Lukas: For the contract? Clem: Yes, for the contract. Instead of having to read five different pages, five-page long documents, you would have an automatically generated summary of the document- Lukas: I see, I see. Clem: -with highlights in green or reds. The interesting part in the documents, like when you see oh, there's a big kind of like money shot, that's why they define how much money you're going to make. Lukas: Yeah, right. Clem: Big green flashing lights, be careful about... Or when there's a small star that says ""Everything that we wrote before is completely...not...it doesn't work in that case"", the small kind of conditions would put big red flashing light, ""Be careful, they're trying to screw you here."" Lukas: I love it. Clem: Things like that- Lukas: That was so fun. Tell me about if Twitter started with this technology available. Clem: What could Twitter do? First it would do the feed completely different, right? It would not show you tweets because they're popular or tweets because they're, I mean, not popular I would say, controversial. But it would show you tweets that you would relate to, tweets that you would be interested in based on what tweets you tweeted before. Hopefully it will be able to moderate things, it would be better avoid more biases, avoid more kind of violence, inappropriate, racism, and bad behaviors. What else could it be? I would have wanted obviously an edit button, but I don't know if NLP would help with that. Lukas: A what button? Clem: No. This famous thing that for ages everyone asked for, everyone has been asking for an edit button on- Lukas: Oh, edit button, no, yeah, right, right. Clem: But it wouldn't be an NLP-powered, let's say I just started today, I would add that. What else? Do you have any idea of what they would do differently with NLP today? Lukas: Well, honestly, I don't know how you feel about this, but when I look at the text generation technology, the NLP technology, and that was the field I actually started in 15 years ago and more. And I almost feel like the thing that's intriguing is the lack of applications, for how amazing the technology seems to me. I remember the Turing test, was this thing of if you could converse with the... I forgot exactly the framing, but it's like converse with computer for 10 minutes, and you can't tell if it's a human, maybe we have like AGI at that point. That seems so impossible, and now it seems like we'll pass it sometime soon. I mean, there's variants of it, but I feel more and more like it's probably computers could trick me into thinking that I'm talking to a person with just GPT-3 or another text generation model. But I actually feel like I don't engage with totally new NLP applications yet. And I kind of wonder why that is. Clem: I mean, I wouldn't agree with you. I think that usage of it is really everywhere right now. I mean, there are not a lot of products that don't start to use some NLP, right? Lukas: Maybe it's just more subtle than I would- Clem: Yeah, maybe. It's less in-your-face in the sense that it hasn't been this big kind of conversational AI interfaces that took over in a way, right? For a very long time, it was kind of like a most popular, and kind of mainstream, face in a way of NLP, right? People think NLP for Siri, Alexa in a way. And that's true is that we haven't seen that picking up, right? Chatbots haven't proved to be very good yet, and we're not there yet in the capabilities in really kind of solving real problems. But I think it became adopted in a way more sober way, in a way more kind of incremental way compared to its existing use cases. You're probably using Google everyday, and that's true that maybe you don't see much of the difference between the search results before and now. But the reality is that, it's the most mainstream, most used product of all rimw that most of the people using every day and it's powered by modern NLP, it's powered by Transformers. But it's not as kind of...maybe the word is ""groundbreaking"" in terms of experience changes as you could have expected, right? I think one of the challenges of NLP is that because language has been so much of a human topic for so long in a way, it carries all this kind of association with AI, right? And kind of AGI, and kind of almost this machine intelligence. And obviously if you look at all the sci-fi with ""Her"", you associate that a little bit with NLP, and that's kind of what you could have expected from NLP. The reality has been more kind of productivity improvements behind the scene that you don't really feel or see that much as a user, it's true. Lukas: Are you optimistic about chat interfaces? Clem: I am. I think what most of us got wrong...I mean, we started by building an AI friend, or a fun conversational AI with Hugging Face. When we started Hugging Face, they were saying we were obsessed with NLP, and we were like, ""Okay, what's the most challenging problem today?"" Open domain, conversational AI, building this kind of AI that can chat about everything, about the last sports game, about your last kind of relationship, and really talk about everything. We were like, ""That's the most difficult thing, we're going to do that."" And it didn't work out. I think what we got wrong and what most people are getting wrong is probably the timing in a way. In the sense that conversation, and especially open domain conversation, the way we're doing it now is extremely hard. It's almost kind of like the ultimate NLP task, because you need to be able to do so many NLP tasks together at the same time, ranking them. I need to be able, when you're talking to me, to extract information, to understand, classify your intents, classify the meaning of your sentence, understand the emotion of it, right? If your tone is changing, then it means different things. I think we're going to get to better conversational AI ultimately, I don't know if it's in five years, if it's in 10 years, if it's longer. But I think we're going to get there. It's already solving some kind of more vertical problems with sometimes customer support chat-bots. I think Rasa in the open source community is doing a really great job with that. I think we won't get tomorrow to the AI who you can chat with about everything, and kind of what we started Hugging Face with. But ultimately I think we'll get there, and that's when in terms of user experience, you're going to realize it's different at that time, but it's probably going to take much more time than what we are expecting. Lukas: Cool. Well, we always end with two questions, I'd love to get those in the last couple of minutes we have. We always ask, what's an underrated topic in machine learning? Or maybe in your case, what's an underrated topic in NLP, something that you might work on if you didn't have a day job? Clem: That's a good question, I mean, something that I've been super excited about in the past few weeks is the field of speech. Speech to text, text to speech, because I feel like it's been a little bit like NLP a few years ago, it's been kind of relegated as some sort of little bit boring field with not so many people working in it. And I feel like thanks to a couple of research teams, especially the team of Alexis Conneau at FAIR with wav2vec, you're starting to see new advances actually leveraging Transformer models, that are bringing kind of new capabilities. I'm pretty excited about it, I think there's going to be some sort of a resurgence of it and kind of leapfrog in terms of quality. Not only in English, but what's interesting is that it's also in other languages. We hosted a few weeks ago community sprints at Hugging Face, with over 300 participants who contributed models, speech to text, for almost a 100 low resource languages. And so it's been pretty cool to see the response of the community. I think there's going to be more things happening in the coming months in speech, which is going to unlock new use cases. Because if you think that you can combine speech with NLP, you can start to do really cool stuff. We were talking about what if like the product is built today, if Zoom was built today with good speech to text and NLP, you can do pretty cool stuff too. I'm seeing something cheery, it should be automatic clapping, because otherwise everyone is kind of [muted]. That's the problem with the current Zoom is that, with everyone muted, when I say something to cheer, I'm the only one cheering. Or when you say ""Hoorah!"", there should be kind of emoji showers, celebratory emojis, or things like that. I'm excited for speech. If you haven't checked the fields lately, you should definitely check it, there are cool things happening. Lukas: Very cool. And the final question, and I feel like you're in a unique place to see this, is what's the hardest part, or what's some unexpected challenges, in just getting a model from kind of thinking about it, to deploy it into production. And I guess you have a unique point of view here where you actually, you have a platform that makes it super easy. Are there still challenges when folks use your stuff? Is there more to do or does it work out of the box? Clem: There are still a lot of human challenges to it, I think, in the sense that machine learning model is doing different things in a different way than traditional software engineering. And for a lot of companies it's really, really hard to make the transition. For example, the lack of explainability, the fact that it's harder to predict the outcomes of these models, and kind of tweak them in a way. It's still really hard to understand and adopt for people who have spent a career in software engineering when you can really define the outcome that you want to get. I think from what I'm seeing a lot of the time like human and understanding of machine learning part is the most difficult thing, more than the technical aspect to it. On the technical part, I mean, we've been excited to bring on larger and larger models, which are still difficult to run in production. We've been working a lot with the cloud providers, we announced a strategic partnership with AWS not so long ago, but we're still working heavily with Google Cloud, Azure, and another cloud providers. But bringing these large language models in production, especially at scale, requires a little bit of skills, and require some work. You can get there. I think Coinbase [Ed: Clem meant ""Roblox""] has a good article, and a good blog post on how they use one of our, I think it was DistilBERT from Transformers, on over a billion inferences an hour, I think if I'm not mistaken. But it's still a challenge and still requires a lot of infrastructure work to it. Lukas: Awesome. Well, thanks for your time. It was a real pleasure to talk to you. Clem: Thanks Lukas. Lukas: Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to these episodes. If you wouldn't mind leaving a comment, and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",6727 +Wojciech Zaremba — What Could Make AI Conscious?,https://www.youtube.com/watch?v=429QC4Yl-mA,2667,2021-06-03,"Wojciech: We almost think about it as the safety of [an] airplane. There will be multiple layers into it. You could imagine that one layer might have to do with appropriate data filtering, or maybe then another layer has to do with injecting human feedback. And maybe some like a final, I'd say, destination by the model at the very end. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host Lukas Biewald. Today we're talking to Wojciech Zaremba who's one of the co-founders of OpenAI and he's worked on the robotics team through most of his time there, where he made the hand that manipulated and solved a Rubik's cube. And at Weights & Biases we have been working with him for quite a long time and rooting for his team. So I'm super excited to get tactical on robotics, but also Wojciech loves to think deeply about the bigger picture in AI and so we'll get into that too. The first question I would ask you about was what it was like starting OpenAI. Wojciech: The first time I heard the idea when I met with Greg in New York, and actually even when I was about to meet the first time I overslept and we were about to meet at 5:00 PM. And then I have this like a weird working schedule that I used to do research over the night. And I was going to sleep like at 6:00 AM, 7:00 AM. So I overslept for our meeting at 5:00 PM, but eventually we met. I would say early on, there was some discussion about the mission of the company. It's also interesting that back then in the community whenever someone spoke about safety, they were considered pretty much crazy. I mean, people are saying, ""Oh, AI is so far away that it actually makes no sense to speak about it."" There were even these quotes saying that it's like thinking about overpopulation on Mars, and at some point that might be a problem, but we shouldn't be concerned about it today. Yeah. So I was also very excited like I told Greg that one of the most important people that we have to have is Ilya. And we've got Ilya. Then there was a meeting around November 2015 in Napa. I met there with Sam, I met there with Greg, Ilya, John Schulman. There was also Andrej who is now at Tesla. And of course we discussed AGI. What are the steps? What do we think is missing? It was also, I could see that these folks are thinking about, like even during spare time, about big fundamental questions. So there was this time that we're sitting at the table and Sam Altman asked everyone what they think what's the solution to Fermi Paradox, why we don't observe aliens? And the people had very sophisticated opinion about this topic. And I was thinking, ""Oh, that's the group of people with whom I would like to work. They consider this even like a metaphysical questions."" Because it's almost like the questions about AGI are almost like a metaphysical. Lukas: That makes sense. But I guess what's intriguing about the way OpenAI operates is at its core, its mission is focused on AI safety I believe. But it seems like the remarkable results coming out of OpenAI, including the stuff that you work on is it seems less about safety and more about showing the power of AI or moving the field forward. Is that right or what's the thinking there? Wojciech: So we have plenty of teams working on safety and one of the efforts working on safety was: ""Let's try to foresight what it takes to build AGI."" So people started to look from perspective of resources, how the things actually scale and connect. And then they realized that we are able to make our models significantly better with appropriate scale-up. And in some sense, that was actually a result of safety work. Another example maybe like that is there has been work on human feedback. So the idea is ""How could we inject human values into the model? How could we tell model what is good versus what is bad?"" Then of course at first people started to work in the academic, most simplified domain. So the question is ""How can we tell model that the given summary of text is good versus bad?"" It turns out that actually these developments led to capabilities, but motivation was from safety perspective. And our stance regarding safety is we would wish to maximally release, let's say, capabilities and description, how to build safe systems. But it's also very likely that it might be the safety and capabilities may be the same thing. In some sense, safety means ""What to do to make sure that we can control the model"". So there's various levels of safety. One level of safety is ""What to do to make sure that we can control the model"". And this is also very similar like from perspective of commercialization or capabilities, that's also what you want to happen. You don't want the models that go nuts when you are asking some slightly out of distribution question. So when you think from perspective of our mission, the mission is to serve humanity and there are actually three different axes how you can distribute what we have developed. One axis is you can literally just give people money, something like universal basic income. I mean, that still requires actually making a lot of money to make any difference to people. Second one is you can give away technology. So we are, let's say, building technology and we are actually sharing it maximally. And third one is governance. So the question is ""How to make sure that the humanity as a whole can decide on what to do with this technology?"" And OpenAI actually is interested in each of these axes and they're like various stages. Lukas: I'm kind of curious, do you fear AGI? You talked about Fermi's Paradox and it seems like one reason that we don't see aliens might be that they developed AGI or some technology and killed themselves inevitably, right? That could be one reason that we don't see them. Do you worry about that? Do you put a percent probability on that? Is that something you imagine might happen in your lifetime? Wojciech: Yes, I think it's possible, but I think that there will be actually various stages of AI developments. The first stage is when AI will become very valuable commercially, and I believe that might be multi-trillion industry. Then second stage is actually AI might become a national security threat. So you could imagine that AI could be used to control a firm of bots or manipulate elderly or sway public opinion for some election or so. In some sense, you can say that it's already happening in some format, that there is selectively displayed content online that actually biases people in various ways. Lukas: Yeah. Wojciech: The first stage is essentially that the value of technology is just keep on increasing, second stage it's national security and then the third stage is existential risk to humanity. It's almost the question [of] how they are spreading in time and so on. And usually we should just be worried [about] the initial parts of the sequence, and we should bear in mind all the pieces. We shouldn't just focus on the last one. Lukas: I see. So we should focus on all three of those risks then? Wojciech: Correct. Or like, the first one, the risk is not an increase of commercial value, I guess maybe the risk might be job misplacement. Lukas: Right. I mean, do you have a sense for yourself a probability that you put on existential risk? Wojciech: It's actually hard for me to think here in terms of probabilities. I could tell you some convincing stories and it's also, I noticed that these probabilities they really change over the time depending on some external factors and so on. Lukas: What external factors change your probabilities? Because we're not really getting new information, right? Wojciech: Yeah. So I'm saying external factors like political climate or so. Lukas: Ah, I see. Wojciech: Let's see. Let me tell you the gloom story and I can tell you, let's say, positive story. Lukas: Okay, great. Let's start with the gloom and then do the positive one. Wojciech: In principle you can say that it's almost an inevitable that we'll build superhuman AI. It's just a matter of time. Then it's also very likely that we'll end up actually with multiple organizations building it. Because it's so valuable and there will be a competition. There might be some organization ahead, but it's very likely that we'll end up with multiple organizations. Then there will be, various people will be tinkering with the code of AI and AI will be tinkering with its own code. And it will have a powerful capabilities to achieve various goals. Initially, this would be goals given by a human. But then [we] can notice that at least in case of natural organisms that also are derived from code, that's DNA code, there is this property that if you slightly mess up the code, it actually, the organism might misbehave. It actually might work against the host. So in case of cells, it's actually possible to get a cancer. And cancer is a prevalent phenomenon in the nature. So then you could imagine now in case of AIs, maybe if you have a couple of AIs, then we actually know what they are optimizing for and who they serve. But once then there is a increased number of AIs and in some sense there's a process of mutation, which is AIs are modifying its own code, humans are modifying their own code. Then there is a process of natural selection. And you can say that the AI that literally wants to maximally spread will be the one that will exist. The things in the universe that want to replicate are the things that exist. Here the main difference is that AI will have just huge power, therefore it's kind of risky. What are the consequences of AI wanting really just to optimize for application? So I guess that's maybe a gloom scenario. Lukas: One question I always have about the gloom scenario, I mean it makes sense to me but I feel like the metaphor of natural selection...well, at least with plants and animals we reproduce, right? So like you can't change the whole system at once, but it seems like AI might have a more complicated system of changing and reproduction. Like you could imagine all the AIs changing at once or communicating. It seems like you might not necessarily...you could imagine a stable equilibrium, right? Where things aren't allowed to consume other resources for example, right? Or is there, am I missing something? Wojciech: It's possible that we'll have thousands of benign AIs and it might be not that simple even, to get all the resources, but you could imagine that randomly happens. So that one of the AIs won't be that benign. And it happened because people are modifying code, plus it started optimizing different reward function, and it still has immense skills and then it can pursue its goal. Then it might be like I said, other AIs are defending the system or maybe they were never trained for defending. It's very hard to predict the dynamics in ""remove the agent"" setup. With one AI you can maybe predict what are the possibilities. It would be still extremely hard, but once you have many of them competing in some sense for resources, very hard to say actually what might be the consequences. Lukas: Okay. So tell me the positive story. Wojciech: Can say that even if AI would become so powerful, it wouldn't even care that much to be here. It'll just go to the stars. It would build all sorts of technology for us. It's like the same way as we are not competing with crystals. Crystals are also replicating. It's like a self-replicating machinery. It's kind of in different level of abstraction, and it doesn't bother us that they are replicating. Of course, there's all the advancements that could happen. Could imagine that AI would cure all the diseases, remove suffering, allow us to go to the stars, so on. Lukas: It's interesting though in both those scenarios, it involves steadily consuming resources and expanding. It's just in one, the AI leaves us alone and the other it doesn't care or maybe it consumes our planet. But in both cases, wouldn't you think that we would see evidence of this in some other alien life that created an AI and came to us in some self-replicating way? What do you think about that? Wojciech: You're asking the question about Fermi Paradox. Lukas: Yes. Sorry. You brought it up, so it's top of mind. Isn't there a collapsed scenario, I guess. Wojciech: Let's say you said, ""Oh, if aliens would build AGI and then AGI destroyed them, but then we would see some traces of AGI in the universe. Like the AGI would consume a lot of resources, assuming that actually...."" So there's a few assumptions. There's assumption that once you are sufficiently technologically advanced, then you're spreading in every direction in the universe with the speed of light. And we haven't observed in any parts of universe, anything like that. We haven't seen any Dyson spheres or so. One simple explanation might be that actually we are alone in the universe. Maybe it's so unlikely for life to flourish that we are alone. So that almost puts maybe more responsibility, but who knows. Lukas: Is that what you believe? Wojciech: I have a probability distribution over beliefs, what might be the case. Lukas: Tell me...you can't reveal your distribution? Wojciech: So let's see. I can tell you a fun one that I heard recently. Let's say you are having super advanced civilization then of course, it makes sense to turn the entire planet into computer and then to maximally use matter for the purpose of computation. One thing that is actually interesting is apparently once the universe would be cooler, then it is possible to do more efficient computation. So one statement is that maybe aliens are just waiting for universe be cooler. But I'm not sure if I believe in this, like it might be cool description. Lukas: So I guess, how do these beliefs inform the work that you do? Like you talked about two kind of bad AI scenarios that both actually seem very relevant to me. I feel like the inequality feels real to me right now at this moment, and the political stuff also feels like it's starting to become real. And then the existential threat feels like you're telling me a very compelling story, but somehow it doesn't feel the same visceral fear for me and my child. But maybe that's irrational. How do you think about...are those three worries what really drives you to do your work? Or are they more theoretical for you and how do you weight the different AI safety issues? Wojciech: Actually, let me at first try to even describe where usually the drive comes from. As a kid, I did quite a lot of mathematics. And you realize that in mathematics, I've got a lot of pleasure by solving difficult problems. That all of a sudden, like this amazing moment of excitement once I was able to figure out a solution to some mathematical problem. I actually realized that that's the main drive for majority of scientists. That there is a just very complicated puzzle involving mathematics and computers, and somehow they can put all the pieces together, and that actually gives them amazing excitement. So that's cool. But simultaneously, it would be very sad if due to these excitement, we would actually destroy a lot of value or destroy how the humans operate and so on. So there is a piece of me that is excited about the technology, about solving mathematical and computer science problems. And there is also a part of me, like I'm thinking maybe from perspective of altruism and responsibility. It's like at some point of my life I realized that ultimately the happiness comes from within and I actually have already everything that I need. Then it's almost like my cup is full, just want to make sure that there is enough for others. So then it becomes quite natural to think, ""How can I actually make sure that my work has the maximally positive impact?"" And in case of AI, it is actually quite complicated. Lukas: Why did you choose to work on robotics? Wojciech: Actually, here is a reveal. I was actually working for several years on robotics and as of recently we changed the focus at OpenAI, and I'm actually, I disbanded the robotics team. Lukas: Oh, wow. Wojciech: Yeah. Lukas: Why did you do that? Wojciech: Okay. So, the reasoning is...there is a few pieces. It turns out that we can make or check on the progress whenever we have access to data. And I kept all our machinery unsupervised, reinforcement learning, they work extremely well. There is actually plenty of domains that are very, very rich with data. And ultimately that was holding us back in case of robotics. This decision was quite hard for me. I got the realization sometime ago that actually that's the best from perspective of the company. The sad thing is, I think if we would be a robotics company, or if the mission of the company would be different then I think we would just continue. I actually quite strongly believe in the approach that robotics took and the direction. But from perspective of what we want to achieve, which is to build AGI, I think there was actually some components missing. So when we created robotics, we thought that we can go very far with self-generated data and reinforcement learning. At the moment, I believe that actually pre-training allows to give model 100X cheaper IQ points, and then that might be followed with other techniques. Lukas: And what is pre-training? Wojciech: Pre-training, that's like, I can explain it in case of GPT-3. So pre-training in case of GPT-3, or in case of language models, means training them on some unsupervised task, such as next word prediction. And that builds in all the internal representation that allows the model to off the bat solve many tasks. And in case of robotics we haven't had such a data. Lukas: I see. So do you regret working on robotics? Wojciech: No. I think that actually we've got plenty of insights for other projects. I think that also we built a really amazing technology. I would say I'm actually very proud. There was like of course moments of sadness when I was making this decision, but I'm quite happy where we've got. Also even from my own perspective, in the meanwhile I manage also other teams. That made some significant progress in the new world and more information, there will be more information about it sometime. Lukas: Cool. I guess one thing that I always observe is when you look at what computers do versus what seems easy, robotics seems the most striking. I feel like the simplest things of picking up an arbitrary object, it seems like the most natural thing for my brain. It seems so hard, maybe harder than anything else that feels natural, to make a robot do it. What do you think about that? Do you think that there's more progress in the short term or will it be the last thing that we solve on the path to AGI? Wojciech: So there are two possibilities for me, like a few possibilities. So one is if someone would be able to actually in a natural way to collect a lot of data, I think that might be the capabilities. Another possibility is that we just need very powerful video models, the same way as at the moment we have very powerful text models. We need very powerful video models to take it off the ground. The trickiness at the moment with video models is that they just require way more compute than text models. So in case of text, already individual word conveys a lot, a lot of information and it just takes few bits to represent it. In case of video, if we would like to process images of a size few hundred by few hundred several frames at a time, that requires orders of magnitude more compute. I believe that if we would have models that have a really powerful understanding of video, it would be way easier to train them toward manipulation. There is also one more technical issue here. It's like, these small models most likely, they would have to be very huge and then the difficulties in running them real time. So at the moment I see a few issues with robotics simultaneously, and this idea to be able to go after domains when the number of issues is like, let's say one or two is very favorable. It's also when we started...okay, in some sense, we started all sorts of projects at the beginning of OpenAI and we haven't had the clarity how and exactly what we want to build. And over the time, we got way more clarity and the amount we can increase the focus in different directions. Lukas: So that's the other question that I've always had, how does OpenAI think about the projects you pick? I feel like, maybe critics would say that OpenAI has sort of been too good at picking projects that are very evocative. Like you guys put out these GPT-3 and the music stuff that you did, like at least to me it just seems so cool. But I think maybe some people feel frustrated that it's like, it feels almost targeted towards like a media event or something. Is that something that you think about at OpenAI or I guess, how does OpenAI pick what to work on next? Wojciech: We have some internal beliefs, what has to be built for general purpose intelligence. And people mostly choose projects on their own. There is also, let's say, there is some level of freedom to go after crazy high-payoff ideas. I don't think ever that people are like saying, ""Let's go after this one because it's high PR payoff."" It's more that we have amazing people in conveying our work to public. And maybe if we would release a GPT-3 or Jukebox as TXT file, then people wouldn't say that it was for, that they wouldn't say such things. Lukas: If you just did a bad job with the PR, the people would give you more benefit of the doubt. But I don't know, I feel like you chose to win Dota which...weren't other people thinking about this and it seemed like it was a very clear milestone I guess, as opposed to putting out a paper on reinforcement learning at massive scale or something like that. Wojciech: Yeah. So there's also actually element of internal motivation with these significant goals. I actually, I think that Elon suggested us to go after Dota. Motivation was, ""Let's pick very complicated game,"" such that if we would make a progress, it would be undeniable. So there is a lot of toy tasks out there. Like for instance, people work on a humanoid walking in MuJoCo and this one is clearly I'd say disconnected from reality. Because people can make it walk in a simulation for multiple years already, but none of it works in reality. And then here in case of Dota, we wanted to ensure that actually what we are after, it's meaningful. So, how to ensure that it's meaningful? Some people are really devoting their life to actually play Dota, who are strategizing about it to play against us. Lukas: How much of the work then on Dota was, you felt, like fundamentally moving ML forward and how much of it was Dota-specific or can you even pull those apart? Wojciech: I think there was a decent amount of Dota-specific work. And then I think it was more than optimal, but also simultaneously hard. So I remember at the beginning of Dota project, it was actually unclear how to approach it. People are saying that contemporary reinforcement learning will have no chance in solving this problem. And people looked into off policy matters, on policy matters, evolutionary strategies. The thing that became quite surprising is that methods that already exist, with appropriate scale work extremely well. So that was a big surprise. And I remember some people even before Dota time at OpenAI, saying that maybe reinforcement learning is a dead end. And all of a sudden it's a very different story now. Lukas: For sure. At OpenAI, do you feel like you're competing with someone? Wojciech: The way how I would like the competition to be fully perceived is actually a competition with bad outcome. Lukas: With what? Wojciech: Bad outcome. Lukas: Bad outcome? Wojciech: Mm-hmm (affirmative). Lukas: Oh, I see. Competing with a bad outcome. Wojciech: I wouldn't like us to necessarily compete against let's say other technical labs and so on, but obviously there is some fear of being scooped or so. It's interesting that in case of large projects, I have seen it the way less than in case of work of individuals on a paper. So my understanding is that it's very easy to be scooped when you're working alone. And it's almost like impossible to get scooped if you work with, let's say, seven people. Lukas: Why is that? Wojciech: So I think it might have to do with that there is many people working individually, but very few working as a group. Lukas: It does seem like OpenAI is maybe uniquely good at that. It seems like compared to academia, you have much more authors on your...or compared to ML research typically you seem to do bigger projects and have more authors on your papers. Wojciech: I think that in reality we need both. Sometimes we need these insights from secluded individuals who are working [from] their hermit house for several months to figure out that there is actually a different way to build a Transformer or to train models or so. And it's almost impossible to work on such stuff as a larger group. But then eventually we want to build systems. The systems allow us to all [at] the same [time], to take our work to next level, next level, next level. Lukas: I guess. What role do you feel like OpenAI plays that maybe the corporate, like DeepMind isn't doing or Berkeley isn't doing? Wojciech: I actually think that OpenAI has fair amount of push on safety, that it became a mainstream topic. It wasn't a mainstream topic. So I think that's extremely important. Yeah. I actually think that's one of the most important things. Lukas: Do you feel like it's sufficiently a mainstream topic now? I mean, it's certainly much more mainstream than 2015. Wojciech: In some sense, I would like it to be sufficiently mainstream such that we would avoid bad outcomes. But I also almost think that the small bad outcomes might be a good thing. Because then they will inform the public that actually these problems are real rather than imaginary. At the moment in case of GPT-3, we see some rudimentary aspects of safety. It's more like on the side of controllability. We have a model that can have a conversation with you, but it's unclear how to make sure that the model won't be offending you or won't go off the track or won't leak some secret information. And you almost think about it as the safety of airplane. There will be multiple layers into it. Like I could imagine that one layer I'd have to do with appropriate data filtering, or maybe then another layer has to do with injecting human feedback and maybe some final, let's say, discrimination item at the very end. So I would say, I think that at OpenAI there is a lot of discussion about this topic. And at the moment, some aspects of safety, they became even important from commercial perspective. Lukas: Mm-hmm (affirmative). And so it seems like you've made GPT-3 something of a commercial product. Is that right? Is that how you think about it? Wojciech: Yes. I mean, our thinking is that if we want actually to deploy AGI one day, then it actually might be very important to have a lower stake around before. And GPT, it's definitely lower stake. We can see what are the ways how the systems might be failing? Lukas: Do you think there's any intuitions from neuroscience in general that can guide the development of machine learning models? Wojciech: There is obviously a question, is consciousness independent of intelligence? Or how they are related and what would it make AI conscious? I guess they're like a few proposals now. It might be the case that all what is needed to be conscious, is to build a model of reality around. And at the moment, our models, they implicitly build such a model. That would be a claim in direction that actually our models are conscious. There is maybe, that's maybe one axis. Other axis is another idea behind what consciousness could be. It's like, you can look in mathematics and computer science for some very special mathematical objects. You can notice that in mathematics, there is a lot of weird things pop up once you allow mathematical system to be powerful enough to point on itself. In computer science, there is similar phenomenon with halting problem. Once the system points on itself, there is undecisiveness. I can say that maybe intelligence fundamentally has to do with compression, and compression and prediction are the same thing, so for instance, next frame prediction is actually compression. And once the system would become powerful enough that it tries to compress itself, that might be some in some way analogous to halting problem or to get the L Theorem (?) in mathematics. Also some people claim that consciousness is not a property of information, but rather it's a physical property, most likely of electromagnetic field. Then that would actually mean that our AI wouldn't be conscious. It could have the same behavior as we do, but it wouldn't be conscious. So I frankly don't know which of this is true and that's something that I actually keep on thinking about, I'll say a fair amount, because in some sense, consciousness is almost our subjective experience. It's almost the only thing that I can be certain about. When I wake up, that's something that I experienced. I cannot be that certain about mathematical equation or that tomorrow there will be a new day, but I'm certain that I'm having conscious experience at the moment. So it is an incredible mystery and I think it should be solvable by science. And AI allows to...or in case of artificial intelligence systems, we can control every aspect of the computation. Lukas: I guess, one difference with consciousness and the halting problem maybe, there's not a binary consciousness on versus off, but it seems to me like there's different levels of this. I think we sort of intuit that in the sense that we want to be kind towards other humans and we want to be somewhat kind to a cat, but we don't put it on the same level. Do you feel like the models you're building might be sort of approaching the consciousness of a worm or something? I mean, certainly they can do things that animals can't do. Wojciech: So yeah, I frankly I don't know. There is a Slack channel at OpenAI about welfare for artificial intelligence. Because it is conceivable that through some kinds of trainings, we could generate immense amount of suffering like massive genocides, but frankly, we don't understand it. We don't know if let's say giving negative reward to model is the same as stabbing someone. Lukas: Right. It seems at first glance it seems maybe ridiculous, but then it's hard to pull it apart. It's hard to really articulate what the difference is. Wojciech: Yeah. I mean, the interesting thing is... So I can see now path from here to AGI. Of course it might take a really long time and people are like, I think that there is a belief maybe that if model would be having human intelligence, then most likely it would be as conscious as a human. At the same time, at the moment I can speak with GPT. I can ask GPT about consciousness, and it would tell me, ""Yeah, of course."" It would explain its conscious state and so on. Of course it has to do with GPT being trained with data speaking about the consciousness. But the weird thing is, how would I be able to distinguish if indeed GPT would become conscious, versus just knowing about it? I think there's a few funny answers that come to my mind. So one is we could try to remove all the data that matches consciousness, train model on it, and then have a conversation about consciousness and the model will say, ""Oh, that's something I was thinking about. And I noticed this thing and that's so surprising that it's there."" That would be maybe one way. Another way that comes to my mind has to do with even how to check that some other human is conscious. So one idea of verifying that some other human is conscious, is literally by connecting brains. If you can connect brains and feel that their consciousness expanded, then that might be an indication that someone else is conscious. There are of course various counterexamples, but you could imagine that similarity, if you would connect your brain to AI, and if you would experience that your consciousness expanded, that might be an evidence. Lukas: Well, that might be a nice note to end on, but I do want to pull this back into a little bit more practical realm for two final questions that we always ask people. The second-to-last question we always ask is, what's a topic in machine learning right now that you think is underrated or doesn't have enough people paying attention to it? Maybe something that you would study if you were totally free to start anew on some other topic. Wojciech: I actually think that I think that the models that can decide on its own compute budget, that they can keep on spinning inside like a Turing-complete model, like a universal Turing machine or a universal Transformer. Or you can think about something like having inner monologue as a means of just increasing amount of compute, that it's the model somehow, while solving problem it speaks inside of its head. I think that's what I would work on. Lukas: Cool. All right. And the last question that we always ask, and this is for our audience which I think is a little more practically minded day-to-day than the conversation we got into, but what's the thing that you think is the hardest part today of going from a conceived model to a deployed model? And maybe specifically for you, I'm curious in robotics. If you were building a robotics company or OpenAI was like geared towards just making a successful robotics application, which would be amazing, what do you think are the challenges that you need to solve today to make that work? Wojciech: So I think that there are actually two stages. So first stage is creating a model that is good enough for any deployment. And then second one is literally building meaningful, viable products such that there is feedback and actually resources can be focused in that appropriate place. Lukas: And what might that look like? So you need something useful enough that you could make a lot of it and deploy it so it's collecting data that...am I understanding you right? Wojciech: Yeah. So, I mean, you could imagine for instance for a robotics company, seems to me that the problem of beacon place is actually completely tractable. I would also say that I wouldn't shy away from collecting data. So I think that the path that I would take now, if I would be focused on solving the problem, I would at first try to find some viable domain where there's a big enough market, the movement doesn't look complicated enough. And then I would ask team, hire plenty of people to parallel operation. And I would collect million trajectories and then train a model on it. And I would say people are very excited about reinforcement learning. And I think reinforcement learning is very, very powerful. And while the same time, I'll say they shy away almost from supervised learning. In my belief, if I would have a company I would double down on supervised learning and it's just keep on surprising me how far it takes. Lukas: All right. Well, thank you so much that was a lot of fun, I really appreciate you getting up so early. Wojciech: Thank you Lukas. Well, have a great day. Lukas: Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to these episodes. So if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",6381 +Phil Brown — How IPUs are Advancing Machine Intelligence,https://www.youtube.com/watch?v=2J8Lo3TD8lo,3430,2021-05-27,"Phil: We can no longer rely on things just getting better, [where] every two or three years we'll get another 50% or 2X energy efficiency, or whatever the scaling is. That's really slowing down. So the specialization of the processors is being driven by that. So we need an architecture that is more memory-efficient. If you go back to the fundamental processor, we don't move data very far. So the whole architecture is geared around data staying local for the processing and the physics of moving data is one of the things that really drives power consumption. So there's doing the actual operations, so driving the computational units, and then there's moving data to and from your memory subsystems. So if your memory's very close, the cost of moving data there is a lot lower, energy costs, compared with like if it's off chip, the cost gets a lot higher. This goes into the power or consumption of the device, where are you spending your power? Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host Lukas Biewald. Phil Brown leads Graphcore's applications team, building high performance machine learning applications for their intelligence processing units, or IPUs. Phil's background is in computational chemistry, which is maybe one of the topics that I really wish I knew more about. What he works on now is hardware for machine learning, which is the other topic that I really wish I knew more about. So I always say this, but I could not be more excited to talk to him today. I really want to talk about Graphcore and what it does broadly, and what you're doing there. But I thought it might be fun to start off with, I was looking at your background and I saw that you were originally trained as a computational chemist and then was working at Cray. And we've actually noticed at Weights and Biases a whole bunch of computational chemists using our software, which has been intriguing. I wanted to hear your career path and how you ended up at Graphcore. Phil: Yeah, certainly. So it's been a bit of an interesting journey and I would be interested to know what they were doing, whether they... I mean I guess running sets of molecular dynamics or quantum chemistry calculations. Lukas: It seems like there's a lot of drug discovery and some material science, yeah. Phil: Okay, yeah. That's pretty much what I used to do a long time ago. So running computational simulations of various different spaces. The way I ended up in the machine learning space was via the high performance computing arena. And actually my PhD was writing computational chemistry codes, so quantum chemistry, density functional theory embedded inside a molecular dynamic simulation, and actually looking to try and accelerate the density functional theory, the quantum chemistry bit of that, using very early accelerators. So actually, I did a PhD at the University of Bristol and there was a company in Bristol called Clearspeed that were building an early numerical accelerator. Think before we had GPUs before that period, right when the first GPUs were coming out, that same kind of cell processor came out, if there are any people who were playing around. So PS2, that kind of era. And this company was trying to build double-precision, so HPC accelerators, so I was actually writing code t to use those for these kind of computational chemical simulations. So about 2005, 6, 7 kind of timeframe. And actually as it happens, my boss today is somebody who worked at Clearspeed and was building those systems and a number of the, particularly the software team, have heritage kind of going back to that. There's a bit of a Bristol group of hardware and software engineers who have done various kinds of things over the years. That was really what got me in from being a chemist, I am not a computer scientist in any sense. I kind of dabble a little bit but I'm not a software developer or a computer scientist. And that took me from a pure chemist into the HPC and computational science domain of high performance computing and building this sort of thing. I spent a couple of years in a consultancy that was specializing in helping people buy these systems actually, and then went to Cray and then helping design and build these systems. And actually I did a variety of different things, I spent a couple of years focusing actually on weather forecasting. So the numerical science and how you build large production systems for weather forecasting. And particularly in the US, NOAA and the National Weather Service, here in the UK the Met Office, and actually around the world at that time Cray were building systems for 80% of the large weather centers, so the national weather centers and those kind of things. So that was great fun. But as the machine learning domain started taking off, I was quite interested in that as a field. It was clear actually that super computing, the high end, wasn't going to continue growing at an exponential rate and there happened to be this little company in Bristol called Graphcore that had a really interesting technology, was just starting to make some waves. So I got in touch with Matt (Matt Fyles, SVP Software at Graphcore) actually, and a few other people, and ended up coming to join Graphcore and actually working as part of leading the field engineering group, so customer facing. And technical teams to actually, working directly with our customers to build applications. Lukas: Can you talk a little bit about, I've always been fascinated by how weather prediction works. Phil: So weather prediction is an interesting field. Fundamentally it's quite simple, the atmosphere is a set of fluids interacting. So you can describe that with a set of equations and you can just kind of solve using those equations. So it in some sense is, it's just a giant fluid dynamics simulation. But it's also a bit more complicated than that because you've got particles, you've got lots of very interesting surface effects. You've got the Coriolis effect where the earth is actually rotating. You've also got quite an interesting initialization problem in that space because you don't... I mean climate simulations are much longer duration, weather forecasting simulations typically, you might only care about the next 10 hours, 12 hours, two weeks. So your initialization is actually critical. So actually the data assimilation where they take the global set of satellite observations and other kind of weather observations and integrate those into the model as the starting position is a really, critically important part of that. So there's lots of quite hairy maths and lots of big computers to try and scale these systems. The other thing that's quite close to machine learning or certainly common is this idea of time to train, time to get to a solution, is quite important. We're running a big simulation, then actually if you're going to have to wait three weeks for it, it's pointless actually running it. Your experimental cycle has to be manageable. And in weather forecasting, a weather forecast has to be, you'll be able to predict two weeks in two hours or something like that. So actually being able to meet that operational deadline for delivery was quite important for that. Lukas: How does the physics, the physics simulations, compare to a more machine learning approach where you make less assumptions about the underlying physics and just try to treat it as a standard prediction problem? Phil: In NWP and in most of the computational sciences in general, you're building a simulation based on some set of physics or chemistry or material science, or whatever particular discipline you're in, biology, there will be some set of fundamental principles that you are modeling in your system, so it's very much a science based, first principles based approach to solving these problems. I mean they typically do have approximations in them, so there's quite a bit of interest I think, particularly in the climate field, but also in the weather field, of replacing some of their parametrizations of systems where the physics is too expensive to run. So the particle interactions are too expensive to model directly at large scale. So up till now they have used approximations for that, actually trying to replace their basic approximations with machine learning models that will be cheaper, or more accurate, or both. So there is that kind of interaction where with everything, in the entire world, you could technically simulate everything right down to the lowest quantum interaction state, level, but that would be phenomenally expensive. You wouldn't necessarily want to do that. Lukas: Also you can't observe it? I think the observations would be messy. Phil: Well I mean if you're going right down to an individual electron, yes, you wouldn't be able to observe that state. But the quantum interactions, the difference between the biology and the chemistry, or the molecular dynamical sphere and the quantum mechanics sphere is where you've got these binding energies where you're actually making and breaking bonds. Those are the quantum mechanical effects starting to come in, like you're making those bonds. So you can accurately simulate those things, it's just you can't observe the individual particles at that level. So the simulation of the kind of binding energy is still possible at that level. But I mean that's phenomenally expensive. At the time I was doing it was difficult to model water and maybe small groups of water molecules where you've got the hydrogen bonds, that was getting a little bit expensive. I suspect a decade on we're probably a bit further than that now, but still you won't be able to model... Well, they might just be able to do a full protein or something like that. But it's also a question of, is it meaningful to actually... you don't need that level of fidelity or you don't need that level of modeling, where do you want to spend your compute time. Lukas: Or even your observation time, I'm imagining modeling the weather on planet earth, you can't get very fine grained at all, right? From observing the state of earth. Phil: Well that used to be the challenge, it very much used to be the challenge. It's a lot better now that they've got satellites that give them complete world coverage. The challenge before that was that you didn't have observations. Actually the Met Office do a great, or they have an interesting set of observations and analysis around V Day where in 1944, the invasion of Europe, the prediction of that weather window where they actually launched the invasion. The Germans did not think that there was going to be a weather window, their analysis of the weather, because they didn't have observations in the north Atlantic, but they had much, much sparser observations in the north Atlantic. So at that point in time there was a real lack of observational information. I think that's been closed, I mean satellites today give you full globe coverage for a lot of things. They maybe don't give you the vertical profile in the atmosphere that you might want in some places, but they also have observations from aircraft and a range of other things. Lukas: So this is kind of a naïve question I guess, when I look online or go onto Dark Sky or something and get the seven day forecast, are those meaningfully improving over my lifetime? Phil: Yes, I mean it depends if you're looking at these things emotively. If you look at the analysis, yes, they're measurably getting better. Lukas: What does it mean to look at it emotively? Just feeling that it's wrong? Phil: I don't know, we're British, we're always complaining about the weather and we're always complaining about the predictability of the weather here. It's raining here at the moment. But the improvement in these kind of forecasts are incremental, but I mean over a decade the accuracy of a forecast out a day has improved quite significantly. Lukas: This is probably location dependent, but at what point does just forecasting out based on the physics of what's going on stop being meaningfully better than forecasting based on climate. Or what's the average state of weather? Like can you predict out three weeks and have a meaningful gain with a physics-based model? Phil: So the numerical systems, and this is getting to the edge of my knowledge now, I think the numerical systems are good out to two weeks. So the long range forecasts are typically out to the two to three week window. Then they're now starting to do seasonal, bridging the gap between climate which is multi-year and decadal and the short term NWP (numerical weather prediction). They're starting to do seasonal prediction, and they are showing skill, i.e. prediction above a random, prediction above the climatology³. They just look at history and base it on the average of history, so they're starting to show skill beyond that and things like El Nino prediction and this kind of thing, they are starting to show skill out in that kind of time. But it's very much the mean, for example are we going to have a wet summer or a dry summer. The challenge, I think, for those organizations when they're articulating that is... So the Met Office had a wonderful thing where they said it's going to be a barbecue summer. The headlines were barbecue summer, that was picked up by the press. What actually turned out was it was a bit warmer and a bit wetter, but a little bit warmer, a little bit wetter and so people's perceptions of what barbecue summer means, that means it's going to be nice and dry the entire time. That's not necessarily what the prediction's saying, it's going to be slightly warmer than average and it's going to be slightly wetter than average really translate to people's experience. So that's the kind of thing, the challenge, the interpreting the information can be quite challenging. But making it generally understandable is the challenge. Lukas: We should talk about chips but I have one more question. Phil: We should, yes, we should stop- Lukas: One last question. What is the function that you're trying to optimize when you predict weather? Phil: Oh, well so they're not trying to optimize. Lukas: Well how do you measure success, I guess? Phil: So yeah, that's a better... So they have a very wide range of metrics. So they're looking at sea state temperature. You're comparing the state of the atmosphere that you predict against the state of the atmosphere that actually exists. So you have a set of observations, so the temperatures, the atmospheric pressures, the amount of precipitation. There's a huge range of skill scores that these organizations generate. If you're interested in this, ECMWF, which is the European Center for Medium Range Weather Forecasting has quite a detailed set of... if you go and dig into their webpage, quite a detailed set of analysis on their forecasts. And as they're producing new forecasts, they're producing analyses of where is it improving and where is it degrading relative to what they had before. And ideally you want all of the numbers to be green. So they're doing quite a lot of work there. And you can actually see the evolutions and going back to computing, you could actually see the evolutions in computers there as well because they step up the resolutions as they're getting better systems. As they work as a software team to develop their software, they're delivering higher resolution forecast which tends to translate to better accuracy in the models. Lukas: Thanks for digressing, that was fun. And if anyone's listening or watching this and knows more about this, let us know. I'd love to know more. Phil: I will have to admit, my knowledge is very much... It's probably five years old and even at that point I was not an expert in this space, so I will apologize if I have got anything massively wrong, please correct me. Lukas: It's okay, so I guess you felt like high performance computing wasn't growing as fast as what Graphcore's doing. I guess what is the difference between high performance computing and Graphcore? Why isn't it the same kind of problem with the same kind of hardware solution? And I should say, I don't really know what high performance computing is so I think I need some definitions to even understand the point. Phil: High performance computing in the sense of numerical simulation where you're using a set of physics or chemistry to create a model of a system and generate some kind of prediction of behavior, or generate some kind of output. Typically those systems are relatively input light, so you'll be inputting a small amount of information, a model or structure of a protein that you want to have... a ligand that you've got an interaction with or a description of a furnace or something like that, and a flame. And you want to understand how that system behaves. So you actually generate huge amounts of information out of those kinds of systems. And there's a huge space and it's been going for many decades, and has in the past 20 or 30 years been growing moderately fast. The machine learning space is really geared around taking very large amounts of data, very large amounts of data and using that to build... Rather than apply a set of rules to that data, you're using the data itself to build the model system and to learn the rules itself. So it's a learning system rather than a system you're designing to solve a problem. I think that's probably the easiest, at a high level, way of describing it. Lukas: What does that translate to you? I can kind of see how those are different, but I'm imagining, ah there's probably a bunch of linear algebra underneath both of those problems, why do you need different types of hardware to solve them well? Phil: So the difference is from a computational science perspective are generally the HPC simulations require quite high precision. And there is a bit of debate in that community about whether you really need 64 bit everywhere, whether you should really be doing 32 bit in some places. But generally you need quite high precision for most of that field. And 90% of it, I'm guessing today, is probably done in double precision. With machine learning you're trying to learn from a very large volume of data and make actually a relatively limited set of predictions out of it. But it's the learning process and what's become very clear is you don't need very high precision when you're in this kind of learning process. So NVIDIA started out, or the people when they were leveraging NVIDIA GPUs started out using single precision. GPUs were good at single precision and it was much faster than double precision. Then people discovered, well you don't actually need single precision, you can do it in half precision. So somebody built some hardware that was better at half precision. So they started leveraging 16 bit and people when they're doing inference they're using 8 bit INTs, and 4-bit INTs and they're looking... People are even playing around with binary formats. So it's very clear that this domain has, from a computation perspective, at that level a very, very different characteristic, from a numerical precision perspective, different requirements. And then the other thing that's quite clear, and actually quite interesting about this space is that, so today we treat everything that we work, almost everything that we work with as dense linear algebra. So if you look at a classic CNN model like a ResNet, that convolutional network is typically translated when you're actually doing the maths with it on the computer, into some kind of dense structure that you're working with. Even though a convolution could be looked at a relatively sparsely connected patterns. And if you look at transformers and these kinds of systems that we're using and seem to be eating the world in natural language processing, they are big matmuls, big dense matrix objects. What we also know is that if we train a model at the end, we can then prune it quite aggressively and not lose very much fidelity. Particularly if you go through a few training cycles afterwards as well. And there have been a number of papers, Rigging the Lottery and a number of other ones, that are theorizing that actually what we're looking for, the systems we're interested in are actually fundamentally sparse. So we want to be able to train sparse systems. We think if we could train these systems in a sparse way, we'd save a huge amount of FLOPS. If we only had 10% or 1% of the parameters in the system, we wouldn't be calculating all of these other numbers. So there's a real interest in these systems in actually being able to do sparse algebra efficiently. And not just for inference, but for training as well. We also are in a place where Open AI and some of the very large organizations in this space, or organizations with access to very significant compute power are building huge models. It would be really nice to not have to quite go as far as that. So if I didn't have to build a five trillion parameter model, and I only had to build a 500 million parameter model, that would save me a lot of compute. It would reduce the cost of using that model, it would reduce the cost of training that model. I might still have to train it over a very big data set, but it would make it a lot cheaper to do iterations upon that. So that's the other thing that I think fundamentally differentiates the machine learning space and the problems that we're trying to solve. And that's not to say there aren't sparse problems in HPC, there definitely are. But that combination of sparse and low precision, and particularly the sparse bit is not something we factored. Lukas: Well the sparse bit is not something that's really supported, right? In general practice. Is there ways to take advantage of that sparseness now with existing hardware to train faster? Phil: So today, or as of... Not really, and so this is one of these chicken and egg problems where somebody needs to go and build some hardware that allows you to solve these kind of problems, then they also... But nobody builds the hardware until the problem's there that really justifies it. So we are starting to see these kind of things evolve. So one of the things that I'm really excited about with our next software release is that we're including both static sparse libraries, the ability to work with static sparsity. So you know the sparsity pattern up front, this might be the attention matrix in a system or it might be a mask or something like that. You can typically know some of these things up front. As well as dynamic sparsity, where you don't know the sparsity pattern. So you can have a changing... and we can deliver this with actually very significant performance on our architecture. Because that's one of the things about the IPU, it was significantly designed to be a very fine grained processing system and to be able to target these problems as well as being fast and good at the dense stuff as well. This is the thing, you can build sparse computing systems but they typically go so much slower than the dense computing systems that actually just running sparse on the density, filling it full of zeros, makes much more sense. Lukas: That's funny, I was going to mention that. I mean I am decades out of date on this, but I remember doing a little bit of work on this in grad school. I mean I would predict I guess, based on my incredibly old experience, that a sparsity factor of 1%, you might as well just fill in all the zeros like you were saying and not even worry about the sparseness. Phil: Yeah, and this is going back to the HPC space, people have never used the sparse solvers within the HPC space because they're so slow, or the sparse linear algebra, unless they've got a 99.9% sparse problem, in which case they start making sense. So some of the interesting things about the characteristics we have in machine learning is they aren't that sparse, actually. They're dense enough that doing the pure sparse arithmetic doesn't necessarily make sense, but we also believe that some of our structures are big enough that you can get away with having small, dense blocks within them. So the thing that's really difficult with 100% sparse systems is... Well there are a couple of things that are difficult, the access patterns moving around a lot is something that's quite difficult to handle. But from a really low level computational perspective, the way that we get efficiency on all of these computer architectures is by having dense block structures that we work with. And particularly two dimensional functional units. So if you want to keep those busy, you need a block of work that's about the same size as those units. So for us those are quite small, they might be 16 by 16. So actually in big structures the accuracy degradation that you get over a pure sparse system going to one of these small block sparse systems isn't too much. I mean this is, and I say that, there's been a very limited amount of work on this because the hardware just hasn't existed. But the indications are that it looks like there's a really nice compromise, where you can get really great performance with a relatively... Whilst leveraging this big, sparse system. So I would say we're right on the cusp of people starting to be able to use these systems and fundamentally explore and develop the algorithms, both the sparse training as well as understand where the break points are. I mean it may be that we discover, actually, no, no, 16 by 16's too big. What we really want is a four by four, or we want an eight by eight. Or we actually need, we're going to need a 16 by 16 works great if we're doing GPT3 and you've got really big matrices, but it doesn't work so well if we're doing BERT and you've got slightly smaller matrices. So there's a trade off in terms of relationship versus the hidden side of something like that. So I think we don't know, I think that's what's so exciting at the moment, is that there is some really new ground. I would say the one thing that attracted me to this space was a), growing clearly a really interesting field. But also virtually, I wouldn't say completely green field, but there's so much we don't know. I mean the evolution over the last five years has been astonishingly fast and it's been really exciting to be part of it. Lukas: Can I just... I'm just trying to picture this, I'm not an expert at all on this space, but does sparsity help with something simple like for example, a convolution? I'm trying to picture what even a sparse convolution would mean, does it mean a lot of the parameters are zero? And my input data is certainly probably not going to be sparse, right? Phil: It possibly doesn't make sense to think of it in a convolution. Although you could clearly maybe have a larger... So typically in a convolution you have a small mask that you're moving across your image, you could potentially think about having a slightly bigger mask that had some holes in it. That would be an interesting sparse pattern. And we've gone to small masks because I think because, well partly they give you a nice characteristic in that they allow you to apply the same transformation everywhere. And we seem to have standardized around three by three in a lot of places. Whereas some of the early CNNs people were playing around with bigger masks and seeing where the sweet spot was. So I don't know whether the standardization around three by three was a performance, as in the accuracy of the model you were making, whether it was a computational compromise in that it was a lot cheaper and it didn't cost you that much in terms of accuracy, but whether that actually there's a better sweet spot with a better sparse model. I don't know. Lukas: It does feel like there's some intuition that, like for example if you're imagining images, pixels closer to each other would be more relevant to each other Phil: Yeah and I think that's certainly... and you would be picking up the edges and those kinds of things that you're thinking about actually going through an image processing process, I think there is some logic there. In the image context, I'm not actually very sure where, how we might be able to use this other than, it's a new toy, I'm sure somebody's going to go and play and find somewhere where it's interesting. I think the area that we're seeing probably the most interest is in the places where you're currently using fully connected layers, and you don't want to have to keep paying the cost of having a fully connected layer. So a stacking, multiple, partially connected layers together looks like quite an interesting approach, and an area that we know... I mean you see this with CNNs as well, you can prune CNNs really quite heavily after you've trained them and still maintain past performance. So can we train those fully pruned CNNs, can we train these fully pruned language models from scratch in a faster, more efficient way. So can we rig the lottery and find that lottery ticket within that large, dense model by a training process. Rather than doing that from scratch. And if we could do that and it's efficient, then we might be able to access an even bigger model because one of the things that limits my ability to train a model is do I want to spend a month waiting for it to train, probably not because I'm going to have to do this 50 times, knowing the ML cycles we go through. If I could do that in a tenth of the time, or even half the time, a quarter of the time, it maybe gives me access to something that's four times as big. And it might be better, and that's the other interesting thing is that if you want to keep going up the curve of model size and try and drive the accuracy higher, having something that gives us more flexibility, another lever that we can pull. Another tool in the tool box that we're exploring this space. Lukas: I can see how at inference time, a sparse, fully connected layer you could do a sparse operation, that seems quite clear. But the training seems tricky, right? If you don't know a priori where the zeros are and the non-zeros, how do you figure that out? I'm asking a deep question that's hard to answer, do you think you could explain that to me at all? Phil: That I think is one of the known spaces because people have not explored this. So DeepMind, I think, published a paper called RigL which is Rigging the Lottery, which was a way that they proposed to try and discover the right sparsity pattern, where you wanted your parameters to be. So I think it's... I mean we train these systems through an iterative search effectively, where we're learning the parameters, it's another parameter you learn, it's the sparsity pattern. So you'll be adding parameters in, you'll be taking parameters away elsewhere. You might have information in the backwards pass about where... So one thing you have to be careful of is you probably don't want to calculate the full set of gradients for that, the dense equivalent space because well... It depends what you're targeting. But if you're targeting something that is very big, that could get very, very expensive. So how you get the signal for where you should be adding and removing parameters, maybe it's something goes to zero and you randomly add it to somewhere else. Maybe you're trying to come up with some other method for adding these in, but that's one of the things we're going to find out is, can we do this efficiently? I mean it might not work, you never know. But I get to go and find out. Lukas: Is this the main thrust of Graphcore's point of view on the hardware? That sparsity's important or are there other... Phil: I think this is one of the things we're very excited about, actually one of the really interesting things right from the start of Graphcore is the founders, Simon Knowles and Nigel Toon, is they didn't set out to, oh we're just going to go and solve deep learning at the time as it was described. They set out to say we want to try and build a computer architecture that's designed for machine learning as a general problem. So what are the computational characteristics of this problem, so what do we need to do to solve that? And to an extent, take a punt at where they thought it was going to go. They got a few things right, I think they probably got a few things wrong as well. But what we've built is designed for general purpose architecture for machine learning. So it is very, very good at dense linear algebra and we're showing in the benchmark results, that I believe will be published by the time we go to the world, that we're showing world leading performance with BERT, one of the very common NLP systems, with some of the CNNs. But we're also showing that some of the classes of models that are more efficient, fundamentally, so Efficient Net, even in the name, but don't run particularly well on a TPU architecture or GPU architecture because they break up the structures that you work with. They're finer grained in the group dimension than the other standard CNN architectures are. Those work really well on our architecture, we have a significantly better advantage, a greater advantage with those kinds of architectures than we do with the standard CNNs. And that's really, we're pretty good at both of them, but everyone else is really bad at the more efficient architecture. That's the same kind of thing with these sparse models is that fundamentally our architecture has been designed to be massively parallel, very fine grain. So you can map these kind of sparse problems onto it very efficiently, and other architectures kind of weren't. They were designed to be very big, block, bulk structured. And they're trying to bolt some capabilities onto it, but it's just fundamentally architecturally a bit more limited in their capabilities. Lukas: So can you explain to me why it works better on, for example, BERT. I don't think of BERT as a... I mean BERT's an embedding, right? Those are sort of, they're not really sparse, they're dense aren't they? What's going on that it's faster? Phil: Well BERT as a model, it has an embedding and then it's got quite a deep stack of transformer layers. So there it's just, we are very efficient at doing dense, linear algebra. So we can beat the dense systems at doing dense linear algebra. Lukas: Wait but why? Can you explain that to me? What are you doing? Phil: So fundamentally, well, a) it was designed from scratch to target this kind of workload and we store parameters and activations locally actually within the physical chip itself. So one of the unique things about the IPU is it's a massively parallel architecture. It has about 1000 IPU cores per IPU, but each of those cores also embeds a very significant amount of memory. So we have about 900 megabytes of memory on each IPU, then we sort gang multiple IPUs together into a larger system. Lukas: So these are like registers I guess? So you have giant registers? Phil: It's not really registers, it's just a very fast local working scratch. So you might think about it like an L1 cache, but it's not a cache because it doesn't really cache anything, it's the memory that we work with. Lukas: So I feel like what you're describing though in my ignorant brain, that's sort of how I would describe what a GPU is doing. So what's the difference here? Is it a more extreme version of that or... Phil: Well so a GPU, its primary memory system is HBM (High Bandwidth Memory), so it's external to the chip. It's packaged in a kind of pretty package, but literally they are stacks of memory that are glued onto a silicon wafer next to the chip. So it's not in the main silicon entity – it's right next to it, you have to go five millimeters through another silicon wafer and go back up into a stack of memory. And that five millimeters means that they can only get, ""only"", about a terabyte a second of memory bandwidth out of their memory systems. Something like that, maybe it's one and a half in some of the A100's. Whereas we get about 50 terabytes a second in and out of our memory systems on one IPU. And actually from a power perspective we probably get about two IPUs per one of theirs, maybe a bit more. So the amount of memory bandwidth we can actually deliver is an order of magnitude, two orders of magnitude bigger in these systems. And that makes it really good at dense linear algebra because we can move data backwards and forwards. Actually dense linear algebra is a bit more limited by the core computational unit than the memory system. But a lot of our advantage comes with systems that are not quite as dense as the pure dense, linear algebra systems or the bits that go around it. So sparse systems, some of these other kind of flavors, that's where we really, really step out. So we're better on BERT, but we're a lot better in some of these other ones. I said this to an American who didn't really understand me, when I said it was kind of like jam today and jam tomorrow. So we're really good today, and then you also get some really great things to come tomorrow. Well, when we start to actually be able to exploit these new kinds of applications. Lukas: And is there any trade offs to your approach or some differently... well I'm assuming with TPUs, Google was imagining a fairly similar workload, right? I mean this was machine learning inspired. So was there some fundamental decisions that you made differently here? Is there any trade offs where your chip might be harder to use or worse in some scenarios? Phil: Yes, so the interesting thing about the TPU is actually from a genesis and idea, the architectures kind of came up about the same kind of time. And the TPU went very, very big from a functional unit perspective so they said we're going to do really big functioning units so that makes life really easy for the compiler developer. It makes life from a software perspective, it makes it a lot easier to target. But it means they really struggle with anything that's not a big, big matmul because their only big functional unit is a very big matmul. Whereas we've got a lot more flexibility with being able to handle smaller and more fine grained workload. So they're inspired, we want to target machine learning. But the observation that they took was okay, well that means we need to be really good at big matmul. And the observation we took was, okay well that means we need to be good at dense linear algebra but we also want to have all this other flexibility. So I would say if there's a downside of our architecture, it's that it makes the work of our compiler and library team quite a lot harder. So they had to work to build the library and the software ecosystem to allow us to attach directly into the frameworks and to provide the lowering from a large scale application workload. So we write in PyTorch, we write in Tensor Flow, to take that and translate that into something that maps onto our massively parallel architecture. So there's not a massive downside from a user perspective, it's a bit more of a downside for our team. I think it's taken our team a little bit longer to get that stack up and running. But what we do see, quite interestingly with this stack is we get very predictable performance across different architectures, across different frameworks. And actually between inference and training, so whereas... So some architectures you might have to go through a dedicated inference back end to get great performance. For us, we just take Tensor Flow, we take PyTorch and we just compile it, run from the framework and we get absolute, tip-top performance straight out of it because we put all of the work into the front end framework, trying to make it as fast as possible. Lukas: I guess it's funny, there's this thing that always makes me feel like I wasted my computer science education or something because I use typically NVIDIA chips. So I upgraded the cuDNN library, which I think is kind of similar to what you're talking about. I mean I think sometimes it'll give me a 30% speed increase. I just feel like this deep mystery of ""what happened?"" The hardware's the same, conceptually it seems like a fairly simple problem. How could you get such a massive increase with a smarter compiler? I guess that's some of the stuff you work on, can you talk about why this kind of conceptually simple thing is so complicated to get right, and why we can continuously improve our compilers to make these things run faster? If compiler's even the right word here, the translation from a network to a hardware... Phil: Yeah I mean I think compiler is the right word and our stack is probably about three compilers stacked, or maybe more than that. So I think the challenges are that these are... Oh if I was a computer scientist, I think that these kind of compiler transformations are an NP hard problem I think, but I might be wrong. But I think that's why that actually solving these kind of systems is quite difficult. So the compilers are typically developed to be quite general, or ideally you want a compiler that you can feed it anything and it will give you something that works. But it won't give you something that's 100% optimal in every domain, because that's a very, very tough problem to solve. So as you find new applications and architectures, then you might put a bit of work into trying to optimize the performance of those. So sometimes what you're seeing is that the software engineers will have found or come up with a different way of laying out the data sets or a different... Sometimes these might be fundamental architectural innovations in that they change the behavior of a system. So that I think is what you're observing here, is that the GPUs have a very different execution model, so sometimes when they're fusing and doing some kind of transformations, that helps them in some particular areas. And I don't really know too much about the development details of those kind of platforms, but for us I think one of the things that we've observed is that I think we've still got quite a lot of headroom. So one of the other things that I'm excited about is that we are quite young in our development process of the libraries and the software and I think we've got quite a lot of performance headroom. So there are some numbers that I've done on the back of the envelope, and I know how fast the chip can go. The chip can go at 250 teraflops and it can get very close to that, sustaining linear algebra. And I know that some things I put through it don't go that fast. And they probably should go faster than they're going at the moment. So that gives me quite a lot of hope actually that even the things that we're talking about at the moment have quite a lot of potential. And that's really, the compiler we have is doing a pretty good job, but it's not doing a perfect job. And if we go and make it better, it'll give us a better set of performance. That's work and actually some of the people that are doing this work are, I mean, exceptionally capable engineers so it's just a case of giving them enough time and space to do some of this optimization. Lukas: So are your chips commercially available? Could I buy one and try it out? Phil: Yes, absolutely. And actually we have just or are just about to launch the second generation of our processors (Note: This episode was recorded in late 2020, before the launch of Graphcore's second-generation IPUs). So we actually launched the first generation a year ago, I believe. They've been adopted and deployed into Microsoft Azure, so we're really excited about the second generation of our product that it will be, we announced this I think a month or two ago, is coming very soon. The interesting thing is we've actually slightly changed the form factor that we're deploying these. We used to build things that looked a bit like a GPU, a PCIe card. We've actually moved to a slightly more integrated form factor that has four of our IPUs in, it looks a bit more like kind of a 1U pizza box sort of server. And it's designed explicitly for scale. So we've moved from thinking about systems that are server-based with a host-processor and a set of accelerator cards to a system that's designed to be able to just rack multiple of these IPU machines together and cable them with an interconnect and you have host remote across the network. So just aggregate that host from IPU processing, but also scale IPUs, we can go from one to 64 out to 1000s of IPUs in a very tight integration. So yeah, we're really excited about this and actually the performance, the scalability, all the kind of aspects of this technology are really interesting. And we're talking about some classes of models, BERT, ResNet, we've talked about some of these CNNs. Actually these are all, I mean BERT's fairly big today. A couple of hundred million parameters, but it's nowhere near the really, really big models that people are working with. So I think that's some of the things that we're really interested in is being able to drive the scale of these training systems, but also do it more efficiently. So we give people the tools to train large systems or train systems to high levels of accuracy without needing to go all the way into that completely dense, linear algebra. Lukas: Do you worry about some of the things that have been in the zeitgeist lately about models getting bigger and bigger? Like only the biggest companies having access to be able to train them or carbon footprint. Is that a real effect? I imagine it might actually help you, but maybe bad for society? Phil: So the societal impact of access to this technology are a fascinating topic. I'm probably not one for this because I suspect we could spend another hour on that alone. We're really focused around trying to make this technology available to as many people as possible and also as efficient as possible. So I think the way that we'll lower the bar for access to this kind of thing is by enabling people to run models that are more efficient, and enabling them to work with architectures that don't require a billion dollars of compute to train the model. I mean the big challenge around that is always going to be access to the data, because I mean the one thing, we find a compute person, I think about the compute, we also to an extent have to think about the data and access to that. And really that's the bit that seems to be favoring some of the very large organizations today is that they have the ability to pull together the training sets that most people don't have access to. So there are two sides to the access to this technology story that I think are- Lukas: What about energy issues? Do you think over time these kinds of chips will become a significant user of energy? Phil: I'm not convinced, compared to the rest of the fleet of web-service infrastructure in the world that ML's ever going to get to the scale where it's more expensive than they are. Lukas: Didn't Google say that some huge fraction of their compute centers was doing inference? Phil: If they have, I've missed it. Lukas: I could be wrong. Phil: So that would be an interesting observation. I mean it's not going to be zero, so the question I think is how much of a percentage of that it is. And also how much of it is going to be training versus inference? I guess if they're driving their search backend via inference and if they're driving all of the back end Google Photos and YouTube and all of those kinds of things- Lukas: And certainly they are, right? Phil: Well yes, so you follow down that, maybe it is. So yeah you could be right, the inference workload could look quite large. But again I think that's probably an area where you would be looking to deploy dedicated chips. This is why people build dedicated chips because they're more efficient than the general purpose chips. So the whole idea of trying to do this is to make something that is more cost effective, so it costs less in terms of dollars per model trained or dollars per inference served to your customer. And part of that's the power cost, part of that's the procurement cost of these kind of things. So I think that comes into the factor, that's why we build these special purpose architectures, or at least specialized architectures. The other comment is with the end or slowing down of Moore's Law, there is a very significant plateau in the rate of improvement, or the shrink and also the energy efficiency. We can no longer rely on things just getting better, [where] every two or three years we'll get another 50% or 2X energy efficiency, or whatever the scaling is. That's really slowing down. So the specialization of the processors is being driven by that. So we need an architecture that is more memory-efficient. If you go back to the fundamental processor, we don't move data very far. So the whole architecture is geared around data staying local for the processing and the physics of moving data is one of the things that really drives power consumption. So there's doing the actual operations, so driving the computational units, and then there's moving data to and from your memory subsystems. So if your memory's very close, the cost of moving data there is a lot lower, energy costs, compared with like if it's off chip, the cost gets a lot higher. This goes into the power or consumption of the device, where are you spending your power? And so that's... One other premise is actually of the IPU is fundamentally more efficient, higher floating point operation watt of energy input because we don't move data as far, we try and keep everything as local as possible for as long as possible. Lukas: I guess one more question on chips, just the timing. Apple recently came out with a new M1 that a lot of folks are talking about that included some ML-focused stuff. Do you have any opinion on that? Phil: Well it's a really interesting bit of tech, and they showed some really interesting overall performance improvements. I think this is an example of specialization going out into all of these kind of systems. I think it's also an example of the spread of machine learning and the workload out into all of these kind of systems. So I'm not sure in the context of Graphcore and building data center scale training and inference systems, it's probably not something that is particularly relevant in terms of marketplace, but it is interesting to see... I mean we've seen this with mobile phones with dedicated inference chips being embedded into, I mean I think all of the ones that I've got kicking around have one of these things in somewhere that they're using for photos and other kinds of things. So I think that's just you'd almost expect it because every kind of modern, consumer-facing workload has some kind of ML embedded into them or I would guess that most of them do. Lukas: Well thanks so much, I mean this has been super fun. I feel like even if it wasn't being recorded, I've learned a lot, I love it. So we always end with two questions, I'd love to ask you these. So the first is pretty open ended, well they're both open ended, but the first one is also open ended. The question is what is one underrated aspect of machine learning that you think people should pay more attention to than they do? Phil: So machine learning is a bit of a chicken and an egg in that because it's built around processing very large volumes of data that require quite a lot of compute, the bar to actually to get to a state of the art solution is quite high, just in terms of the amount of work that you have to do from a computational perspective. So you have to have, any kind of data processing algorithm has to be quite efficient and be able to run at teraflops, tens of teraflops to be able to chew through that. So either something that's much more data efficient in the way it learns, or something that we can find new computational architecture to give us the efficiency on new classes of models, I think those are things that might be really interesting. Lukas: I have to say it's funny, we've had a bunch of computational chemists talk to us on this show and also in customer interviews, they're all talking about graph based networks. It seems like that might be an area where there's a lot of interest in. Phil: So one of the ones that we've been working on, and I'm not sure when we're going to be able to publish it, but actually is a graph based neural network using the spectral library in Tensor Flow, and it's a very small example. It's not anything fancy or ground breaking, but it's just an example I think to doing molecular binding prediction using that kind of approach. Lukas: Cool, the final question we always ask is what's the biggest challenge of making machine learning models work in the real world, but I'm kind of tempted to modify it for you. I'm wondering, what's the biggest challenge of taking a new piece of hardware to market? It seems like there must be challenges everywhere. But where's the surprising challenges? Phil: So I would like to answer the first on as well because one of the things that we've done quite a lot of, so we've talked a lot about performance. How fast does it go, and actually performance is a beautifully simple thing because it's very easy to measure. What's the images a second? What's the sequences a second, how fast does it go? But the other bit of that is actually you don't just care about how fast it goes, you care about it giving you the right answer as well. So you care about your systems converging. One of the things that we've been really interested in exploring, actually part of the reason that we're working with Weights & Biases is as part of these kind of building very large convergent systems, leveraging and doing all of those kind of experiments. So finding the right kind of batch size that gives you the optimal performance whilst not impacting your kind of convergence scheme. That's one thing that we've been working with. We had quite a lot of fun I think with the numerical behavior in some of these systems which particularly, so we talk about low precision, good, goes much faster. Also, dangerous because you need to manage the precision a little bit more accurately than you might do in some other kind of systems. So building a system that gives you great performance and also gives you the right answer, I think that's one of the things we've found interesting as we bring these systems up and particularly, I would say that the first generations of our systems and we had some really interesting convergent schemes running very, very low batch sizes showing actually extremely rapid convergence, even on some big models. And they were really good, the one thing that we observed today looking at our large scale systems, is that they wouldn't scale. They wouldn't have enough batch size to be able to scale to very large systems and we're actually reworking some of the systems we work with to support much larger batch sizes. So looking at optimizers, we would be uses SGD or SGD-M quite a lot, SGD-Momentum quite a lot. We're looking at LAMB, very large scale batch optimizers that have been used by Google and Nvidia as well for their large scale systems. So yeah that's certainly been something that's been a whole bunch of fun, and I would say has been very challenging. I mean the number of hours of compute time that we have been spending developing these kind of systems, to a certain extent, finding the bugs in the models sometimes where oh, we've got the layers wrong or there's something that's just not quite laid out correctly and that's impacting the convergence of these systems, so we need to go and find that. So there are those kind of things. In terms of actually building, bringing the new hardware to market, that has been a tremendous journey. It goes all the way from completely new architecture, massive amounts of memory on chip. How do you at the fundamental silicon level test that system and make sure that your processor actually works. So that was an interesting problem that some of our team had to tackle, and we very successfully worked through how do you take one of those systems and integrate it together into a cluster of 16 IPUs, a cluster of 64 IPUs, a cluster of 1000 IPUs. How do you make that kind of system work at that kind of scale? How do you take all of the various applications and map them down to the frameworks, how do you support multiple different frameworks efficiently? There's been lots of fun across all of these spaces. So one of the things that I would observe is building these very large scale training systems is one of the big challenges, it's one of those really big... It's a bit like building the old super computers, the grand challenge problems of our time essentially. So it's quite interesting to go and try and do that from scratch with a completely new set of architectures, and actually I mean one of the fantastic things about Graphcore is how quickly we can move through some of these processes. There have been a lot of challenges through that phase, I would say we've met most of them with great success which is quite nice. We're at the point where we can now bring this all to the world, which is very exciting. Lukas: That's so exciting. It seems like such a fun job and congratulations on the latest benchmark, we'll definitely put a link to that in the show notes. Phil: Yes, thanks for having me, I mean it's been a lot of work from quite a large team of people. And actually very little from me, so the hardware and the software team at Graphcore have been beavering away for a long period of time and they've all done a really fantastic job. Lukas: Awesome, thanks for your time. Phil: Excellent, thanks very much Lukas. Lukas: Thanks for listening to another episode of Gradient Dissent, doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to the episodes. So if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot. ",10354 +Alyssa Simpson Rochwerger — Responsible ML in the Real World,https://www.youtube.com/watch?v=T4Fuk9Ow9J4,2729,2021-05-20,"Alyssa: One of the challenges in the healthcare space is often you don't get the answer to did this treatment solve the problem. You either get nothing happened after that, right, or maybe I went to a different doctor or somewhere and you just don't have the data, or maybe I didn't take my meds because I didn't pick them up or whatever else. But there's a lot of challenges in the healthcare space with actually getting good data sets in order to do machine learning. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Alyssa Simpson Rochwerger is an old friend and colleague and expert on real world AI. She's currently the Director of Product at Blue Shield California, and, before that, she was VP of AI at Appen and Figure Eight, the company that I founded and ran for a decade and sold to Appen. Before that, she was Director of Product at IBM Watson, where she was an important partner for Figure Eight, so she has over a decade of experience in machine learning, and she's the author of the book, Real World AI: A Practice Guide for Responsible Machine Learning, which covers basically everything we talk about here in this podcast every week. So I'm super excited to catch up with her and talk with her today. Tell me about your work on the vaccine. I'm dying to hear about it. Alyssa: Sure. So at the end of January, you may have heard, Blue Shield got asked to help with the vaccine rollout in California, and I was privileged enough to get a phone call, I think, the following Saturday from one of our senior executives saying, ""Hey, Alyssa, can you come help? What are you doing right now? Can you join our meeting with the state either today or tomorrow?"" I said, ""Sure, Jeff, you bet."" It was supposed to be a two, three-week thing, and I think this is week 12 where I've completely dropped my day job on the floor and just helping out the state, and so there's a team of us that has been deployed full-time, and it's been an absolute whirlwind and privilege and really exciting. Lukas: So what are you doing practically? Alyssa: Yeah. Have you heard of myturn.ca.gov? Which, if you haven't, go get your vaccine, schedule it. So there's a website where everyone in California can get vaccinated through and schedule appointments. We've been coordinating enhancements to that, working with the 61 different local health jurisdictions in California. Each one has a slightly different set of challenges and opportunities. So, for example, the Bay Area, we have really low hesitancy rates and a lot of really eager people who are willing to drive three hours to get their vaccine, whereas down in Southern California at the moment, there's where we have more supply and we're starting to experience more hesitancy. There's appointments availability pretty easily, hard to reach communities that are not interested or not able to access the vaccine, so we put a really heavy focus on equity and making sure the people who need the vaccine most get it first and are able to access it. This week, it's all about home-bound populations, people who can't leave their houses. How do you get vaccine to them? When these things come in 1000-plus dose things, you got to thaw them out. Pfizer's a deep cold freezer situation. It can only be at room temperature for so many hours. If you are an ambulance worker going to a home-bound person's house, you need special training to understand exactly how to administer this and how many houses can you go to before the vaccine expires, right? Anyways, so, logistically, super complicated, and so I'm helping on the operations and tech team, so everything from doing data analysis to understand where should we ship vaccine and who do we get it to to helping onboard providers. I think we've contracted with over 3000-plus or more providers in the state of California, so Kaiser's a massive one, Sutter Health, Dignity, but there's a long tail of much smaller clinics. Providers, I think, there's over 1500 clinics on MyTurn that are giving out vaccine across the state, so different challenges in Tulare County versus Alpine County versus the Moscone Center in San Francisco and the logistics of making sure everyone in California gets vaccinated. Lukas: Wow, and is it- Alyssa: So there's a lot to do. Lukas: At this point, is it mostly the logistical problem just getting the vaccine to the person that wants the vaccine, or are there other- Alyssa: There's a lot of challenges. The three big things that are limits to getting shots in arms are supply, so the first several months of this have been supply constraint. We only get so much supply from the federal government. The other potential constraints are ability to administer vaccine, so that was what we focused on really heavily for the first month and a half or so, the third-party administrator for California, is making sure that we could build up a network of providers who had the logistical capability to receive supply of vaccine and administer it, right? So you need nurses, you need security guards, you need freezers, you need ability to mass vax or whatever it is. Some of these are mobile clinics going into agricultural communities, some of these are how do you get the word out to people, so all that capacity. Then the last problem is willingness, right, as you need people, arms, to put shots into, and so some of that is a hesitancy problem. Some of that is ability to schedule an appointment, right? So 40% of California speaks Spanish, and then there's a long tail of other languages, Vietnamese, Chinese, Hmong, and how do you address and reach all those communities, not just logistically support them with making an appointment if they want to but also helping them understand that the vaccine is good and safe and they should show up and get an appointment. Curve balls get thrown like J&J no longer being administrated, so that was last week. I think we found out at 6:00 AM or something, and by three hours later, we were able to switch the supply to ... I think there were 8500 appointments in the 48 hours, and we had to switch to either Moderna or Pfizer for the vast majority and then reschedule a handful of those appointments. Lukas: I guess, as a data person, did you have feelings about the J&J decision? Are you even allowed to talk about things like that? Alyssa: Oh, I have no insider information. I read the news just like you do. I assume that the really incredible scientists and doctors who have been making this vaccine and diligently testing it and following the quality control protocols ... It's a good thing that they're pausing and reviewing it and looking thoroughly. I have plenty of loved ones who've received the J&J vaccine, and, so far, they've all been good and haven't had any problems, knock on wood, but I'm really glad that everyone's taking it super seriously. Lukas: Well said. I guess the main thing that I was planning to talk to you about was the book that you wrote, which is- Alyssa: I wrote a book, yeah. Lukas: Yeah, you wrote a book. Congratulations. Alyssa: Thanks. Lukas: Well done. It feels like real world AI is really, as long as I've known your career, what you've been working on, so it does seem like you would be the person to write this book. Actually, I'll say one thing as an aside. I'm always impressed by people that are able to write a book without feedback. Was it a challenging process? How did that go? Alyssa: I was voluntold to write a book, which I think I've said before. It was a fascinating process. I had a lot of help, great team. I'm dyslexic, could certainly not have written a book by myself, so- Lukas: Are you actually dyslexic? Alyssa: Yeah. Lukas: Oh, interesting. Alyssa: Yeah, extra- Lukas: How does that- Alyssa: Extra time in school, the whole thing. Lukas: Really? Alyssa: Yeah. Lukas: Wow. Man, working with you, I'd never noticed anything like that. Do you have a hard time reading? Alyssa: Dyslexia is an umbrella term. It means a lot of different things for a lot of different people. So me and my sister, both dyslexic, totally different manifestations. I cannot spell to save my life as an example. There are always typos and issues in every email I send, and I don't even see it. My sister is an outstanding speller. Math, not her strength. So our issues are different. To be diagnosed with dyslexic, you have to score above average or pretty high in certain categories and then average or below average in other categories, and the delta between those in enough categories is what classifies you as dyslexic. Lukas: Interesting. Alyssa: So, person to person, you could score high or low in totally different areas. Lukas: So I would imagine that would make it even harder to write a book, something that already seems very hard to me. Alyssa: Yes. No, so writing the book, it was a really interesting process, took a long time. We started out with an idea of what we wanted to do and organized that into an outline and then started fleshing out those outlines and interviewed you, thank you so much for your interview and contributing to it, and then lots of other folks who were willing to share their stories about what it's like to actually build and deploy machine learning based technology in the real world for real, actual use cases and not BS hype. What's great about the machine learning community is people are really nice and they want to share their stories and they want to help others is what I've found really consistently. Not every story were we able or authorized to use publicly and put in a book. There's a lot of lessons learned, and we had to anonymize quite a few. But a bunch we could, so it was awesome. The process is you do an outline, and then you talk through each chapter one by one, and then you go back and reorganize information or content that makes sense in perhaps multiple places. Our editing team knows how to write books, and they do this all day long for a living, so turned word vomit from Alyssa and Wilson into actual paragraphs and sentences. Lukas: I feel like you had a focus, as you do in your career, on ethics and responsible AI. Was it hard to get people to talk about that? It is in the zeitgeist, but I wonder if it's hard to get real world stories of tricky issues. Alyssa: Yeah. Easy to talk about at an abstract level, easy to talk off the record with people around lessons learned and challenges, harder to get them to go on record about failures and specifics of those failures in large public companies. Yeah. Lukas: Can you talk about some of the failures and what happened, some of the anecdotes that you have in your book? Alyssa: Yeah. I'll start with a personal one that I think we've talked about before, but when I was at IBM, I was new to machine learning and we were launching a visual recognition system. The API did a very general thing, but we were improving the accuracy of it, and I was new and I was like, ""Well, how do you know it's better? How is the accuracy better?"" The team settled on the F1 score as a fairly good measure of that, and our F1 score improved, and there was a big delta, and we were excited to launch the next version. A couple days before launch, one of the team members reached out to me and said, ""Alyssa, we can't launch this,"" and I was like, ""What are you talking about? It's better. We've all agreed. There's a lot of energy behind this."" He sent me an image that I tested against the algorithm, and the tag that came back for that image was the word loser, and the image itself was a picture of someone in a wheelchair. I was horrified, and I thought that that was terrible bias that we didn't want to encode and we certainly didn't want to launch. It really gave myself and the team a wake-up call to, hey, how could this have gotten into our data when the accuracy is supposed to be better. The aha moment for me as a newbie was, well, stupid, of course, it's the data. The data and the training data and the tags that you've associated with that are the problem. So we have a great team where we went back and reviewed every single tag, which was thousands and thousands and thousands of tags and millions of image, and we reviewed it by hand as a team. We divided and conquered, and we pulled out quite a few objectionable things that we didn't want to be the public face of IBM. That took time and money and pain, and we were able to relaunch something that contained less what I would call unwanted bias in that particular system, but IBM's certainly not free from that, and many others have had challenges with visual recognition systems. Particularly, there's been a lot of talk recently about bias in facial recognition systems, so they're tricky to get right. Lukas: It seems hard to fix, but maybe even more concerning is that it was just caught by someone who happened to be trying something. Do you have recommendations on diagnosing these kinds of problem? Alyssa: Yeah. So that was a long time ago, and, since then, there's a ton more. But what I always say is, ""Be proactive around the biases that you're looking for."" There's a handful of biases that are regulated, right? You don't want to have gender bias, you don't want to have racial bias or other ones. So depending on what your system is deployed for, you're going to want to set up a proactive monitoring system to take a percentage of your real time public data, do it also in tests, but also, once you've launched the real time data, and siphon it off and review it, typically with humans manually, or at least set up some alerting if things fall or skew outside of what your normal expectations would be. That involves usually dashboards and a lot of data and looking through tags but also proactively setting up a feedback mechanism so that people can report things that you didn't hear about or you didn't think of and being able to escalate those quickly and react quickly and adjust and hopefully be able to retrain your model or remove it or have a back-up plan that does not include your model if you need to take it down for some reason for an extended period of time to mitigate things that you didn't anticipate. Lukas: But then, I guess, fixing it, it doesn't seem realistic these days to go through all of the training data set and take out everything objectionable, or maybe it is. On a model that's trained on millions or billions of records, do you have recommendations there for how to improve the quality of the training data in a maybe more cost-effective way? Alyssa: Even if it's millions or billions, it's where are you getting those millions or billions, and is there selection bias in where you're getting that data from? So take a speech recognition example. The speech recognition systems today available in the US are better at understanding men's voices than they are women's voices, or they are better at understanding people who speak English as their first language versus people who speak English as a secondary language, and that's largely due to the data that is collected, and it's thousands of hours of data collected, but if you're not actively collecting data from the populations that you want to serve, you're going to have a challenge there. So I think, even in aggregate, it's appropriate and quite feasible to think critically around where you're getting your data and does it reflect the community or the people that you are going to be serving with the model. Lukas: I guess it's a little bit of a different issue maybe, men and women versus English as a first language, English as a second language, at least if you think about ... Well, I don't even know, I guess. I guess there's more people with English as a second language, but you could imagine a case where there's a smaller group of speech patterns, for example. Do you think that you should collect at the ratios that you have, or should you try to over-collect the more rare cases? Do you have thoughts on that? Alyssa: I think that's where a team comes in, and I'd ask you ... I'd certainly skew towards over-collecting rare cases but definitely monitoring for those cases to understand how those are performing. I think as a team you need to understand and balance the business priorities because it's not always feasible to collect. So let's say you're trying to deploy audio recognition in a call center, and let's pretend you're, I don't know, Walmart, right? You serve most of the United States, but look at your customers. Do they skew to people who speak English as a first language or people who don't speak English as a first language? Are you going to deploy in your entire call center, or are you going to start with just California or just Texas? Start to look and deploy models in a small way is what I find often works best to find a narrow place to apply, and then scale up as you can prove success and also collect more data typically. Because let's say you deploy it in Texas, so Texas has a heavy Spanish-speaking population and you get a model that works well. Let's say it's only for handling returns. But if you then want to expand, say, to Georgia, well, a Southern drawl is going to come into place, and that model that you built for Texas is probably not going to work that well for the population in Atlanta, which skews more African-American versus Latino and that's a different sort of speech pattern. So you could deploy the same model, but you're probably not going to get the same results, and so needing to collect more data, mature it, and tweak it. I think less around, okay, how do you start right from the very beginning to try to do everything well, and it's more like, hey, start small with a specific and narrow business problem and do that well and then gradually grow and use perhaps different or related data as you grow in order to address those additional needs. Lukas: That makes sense, and that just seems like best practice for any case, even setting aside ethical concern. Alyssa: Yeah. I think one of the big not mistakes perhaps but challenges coming into machine learning is that there's a lot of hype and everyone thinks you can solve a really big problem with machine learning with magic and that's just never the case. It's much, much more successful to start narrow and start small and build out, and that's also a good way to address unintended bias, is by narrowing what you're trying to do because it narrows the data set that you need, it makes it less expensive, and it allows your pilot to be more successful. Lukas: I guess, as you were researching the book, what other sorts of anti-patterns or failures did you see besides those types? What other things did people run into? Alyssa: We talked a little bit about the Goldilocks problem, which is trying to pick the right problem to solve and pick the right size and narrow problem to solve that's well-suited for AI. I think another challenge is around team and getting a successful team in place that has the right mix of skills in order to successfully deploy something to production. This is not a case of a lone data scientist or even a team of data scientists building something. In order to actually get something into production, you need DevOps, you need data engineers, you need a UX designer, typically. You need a product manager, you need regular front-end software engineers and backend engineers. You need a whole team of people that is responsible for actually deploying something into a production environment, and that can often be, at many companies, harder than developing the model itself, is actually getting to production, because you start to run into things like legal and security and risk tolerance. All of those things mean you have to have a back-up plan and you have to understand how you are going to handle unknown and you're going to need escalation paths, and putting those sort of business and technical processes in place, often, it's the business thinking through that implications of the model or what happens when a decision is made and what happens if you need to explain why that decision was made to auditors or whoever is going to be scrutinizing this. That conversation, if you don't start early and don't involve those people early in the process, can be big blockers to launching something to production. So what I encourage folks to do when they're starting out is think broadly around the cross-functional team of people that you want to have on your bench. They need to be diverse, right? The finance people should certainly get involved. Sometimes, HR needs to get involved, and it's not just the engineers. Lukas: Is there any specific stories you can share where legal came in at the end and blocked something or HR or finance wasn't involved and then the project couldn't launch, even though the ML model was working well? Alyssa: So I'll use the Amazon one for HR, right? There's a very public story around how Amazon was trying to use machine learning to predict who was going to get hired at Amazon or who would be really strong candidates for jobs there. I don't know if it was HR that blocked it at the 11th hour, but they found it to be not serving the HR professionals and the goals of the HR professionals because it was super biased and it was biasing against women pretty heavily. So that's a scenario where the model was working very well, I think, at the beginning, which is predicting who would be strong candidates, but they weren't considering some other goals that were really important to Amazon, which is hiring a diverse employee base. So that's potentially a case where the training data wasn't appropriate, or I'm not sure exactly what went wrong behind the scenes there. Perhaps you know those people. I don't. But those are the types of things where legal or HR can say, ""Hey, you know what, can't do this."" I know Uber has also had challenges in terms of making sure that their escalation path for support tickets that they use machine learning to classify appropriately classify the right tickets in the right way to the right level of severity and scrutinizing that process. Because if you miscategorize something that's really urgent, that's a potentially legal challenge for the company. Lukas: I guess channeling what our audience asks me about all the time, I'm curious if you have suggestions for an ML practitioner who wants to work on something meaningful or wants to work for a company that really embodies responsible or ethical AI. Do you have any suggestions in what they might look for in the interview process or before that or maybe even companies that you think do this really well? Alyssa: Sure. I recently got out of the AI business, and I got into healthcare, which a lot of well-meaning mentors and people I admire are scratching their heads being like, ""You left all these lucrative job opportunities on the floor to go into an insurance company. Alyssa, are you out of your mind?"" Maybe I am, but what I looked for, I followed the money when I was making that decision, and I don't mean personally. I mean follow how the money goes in the business or the organization. So I chose to work at Blue Shield, which is a nonprofit organization, and the incentives for the company are to cover more people in California with health insurance at a lower cost. By law, we cannot charge more for premiums. If we accidentally take in more money than we pay out in healthcare, we have to give it back to the people of California, which, this year, because of the pandemic, the model's all over the place and wrong, and we ended up giving a lot of money back to our subscribers. So, for me, understanding how a company makes money and what drives the business will ultimately drive some of the models that take place. If you look at Facebook or you look at Google or you look at Amazon, these are for-profit companies. Facebook makes its money on advertising, and so they have some of the most sophisticated advertising models in the world around encouraging the right content in front of the right person. For me, that wasn't something that I wanted to spend my time doing. There's a lot of awesome people that work there, but it's not for me, and I decided to go in a different direction, and I think it can be really hard for people to take a really hard look at where they want to spend their time and their day-to-day and what problems they want to think about. I'm feeling really fortunate to be thinking about how to get more vaccines in arms. There's not much machine learning that's going into that, frankly. It's spreadsheets and pretty basic data analysis, but I'm thrilled to be spending my time doing it, and I hope that Blue Shield can work on some cool interesting machine learning problems in other areas. Lukas: I guess I wanted to ask you about that. It does seem to me like there are a lot of really interesting ML applications in this field. One of the Blue Cross, Blue Shields ... I think they separate by state, but I think one of them is actually a Weights & Biases customer, and I think Figure Eight had some customers in that realm. Do- Alyssa: Yeah. There's tons of interesting use cases in health insurance. Lukas: Can you tell me about some of the use cases in health insurance? Alyssa: Sure. A simple one that you know super well, Lukas, is around looking at healthcare data. If you're looking at an aggregate for thousands or millions of people, you're trying to understand what are patterns in terms of a patient's record over their lifetime that can be indicative of good outcomes, right? For example, I've been having carpal tunnel challenges from working at home and not moving nearly enough, and I went to the doctor, and they prescribed some steroids and some physical therapy and whatever else. But if you look a few months later, my hand was still bothering me. That didn't really work that well. So are there patterns that you can look at at a population level to recommend particular courses of treatment that work? From a machine learning perspective, if, and this is a big if in healthcare, if you have a good data training set that's cleaned and well-organized and you're able to access, you could look at large outcomes like that and say, ""Hey, did Alyssa need follow-up after that? Did we have to spend more money on healthcare? Was her problem solved or not?"" That's actually one of the challenges in the healthcare space, is often you don't get the answer to did this treatment solve the problem. You either get nothing happened after that, right, or maybe I went to a different doctor or somewhere and you just don't have the data, or maybe I didn't take my meds because I didn't pick them up or whatever else. But there's a lot of challenges in the healthcare space with actually getting good data sets in order to do machine learning. So that's one use case. There's other use cases that I would call simpler, like chat or people try to file claims or have billing issues and being able to respond faster to people and make our call center agents more efficient with their time by automatically answering tier one support issues like I lost my password or whatever else and being able to handle that in a lot of different languages. For example, some machine learning can support those types of use cases. Lukas: I guess, how real is this? Is ML chat used today? Like if I went to the Blue Shield website, would I interact with a chat bot? Alyssa: Yeah, we're rolling out chat. I don't know if you went today. I'd have to get back to you. It's not my particular area of ownership. But we certainly have chat, I think, also for the providers. I've learned a ton about insurance. It's an interesting space because you have customers that are members, like you buy health insurance from us or you get it through your employer, but we also have doctors who interact with the insurance company for lots of different reasons, so that's the providers, and then, also, the employers or brokers or HR people, and all those people need help. So I'm pretty sure our chat is rolled out for employers and brokers and providers. I'm not sure if it's for members. Then, also, we certainly have it internally as well, so if I need something as an employee, I can use our virtual assistant internally to order a new mouse or get provisioned access to a system or whatever else for IT support. That's actually been a really successful use case for us. Lukas: The health record stuff seems so evocative, right? I would love to be able to do a deep data analysis on my own health record and- Alyssa: Yeah, if you could get it. Ask your wife. Lukas: If I could get it, yeah, and look out into the future. Maybe these do exist. Would you say that your employer is currently doing analysis of health records to forecast what might happen to people? Alyssa: Yeah. Absolutely. We look at population health, and not just us, but we work with other companies who perhaps do some of this analysis. Then we actually consume the insights from those analysis, and we work with a lot of different partners. There's a platform we call Wellvolution, so I'll give you an example. We work with one company that has done a lot of analysis around kidney disease, so people who are on dialysis and getting good outcomes there. They figured out, ""Hey, here is the right way to treat kidney dialysis patients that has better outcomes,"" and so we encourage and steer our patients towards this particular program because it's proven that it has better outcomes than perhaps treating it without this program. So that's an example where we try to recognize patients who have a particular diagnosis or condition and then encourage them to use the programs that have the best outcomes. Lukas: I see. So it'd take one hypothesis and just test it based on... Alyssa: No, not that I'm aware of. Maybe there are other people that I don't know about. But it's more around, if you look at population, the big things are the same things, right? It's diabetes. It's hypertension. These are the big things that impact our population, and so if you can encourage people to shift their often lifestyle habits to things that are going to be more successful, you can have better outcomes, but as it turns out, it's not easy. It's a lot easier said than done to get people to take their health seriously, and some people don't, right? Some people are like, ""It's just not a priority for me to change my lifestyle to be healthier,"" and other people are super, super eager to do it, and then there's a bunch of us probably that fall somewhere in between on that spectrum, and we're willing to make certain accommodations or changes in our lives and others we're not. So how do you use different tools or different approaches for different populations to move them into healthier lifestyles? Because if you take a step back, at least, Blue Shield, it's not that we want to pay less money in healthcare costs. We want to get everyone healthier because healthcare as a force in the macro US economy is an incredibly inefficient expense that we simply can't afford. It represents a huge percentage of our spending as a country, and it's not sustainable, and so we need to find ways to get our healthcare costs down as an industry overall because it's just not something our economy can support. Lukas: What have you been working on, before the vaccine and post leaving Appen? What projects have you been- Alyssa: I was working on this longitudinal healthcare record problem. We launched a pilot, which I'm super excited about, and so for a certain percentage of our members, they actually can get their longitudinal patient record with every provider data, if we have access to it, so that's a big if. But if you've submitted a claim to Blue Shield or your provider participates in one of the statewide networks, in California, it's called Manifest Medex, and they have thousands of providers that send data, and by provider I mean doctor. So if you're a large healthcare institution, you may participate in this, and then we show it to you as a member, and you can look at your record. Then we also recommend things that perhaps you haven't done, so if you haven't gotten your flu shot or if you haven't been to your annual check-up or you're overdue for a cancer screening or something like that, we'll say, ""Hey, Alyssa, you haven't gotten your pap smear this year. Go get it done."" You can interact with us, and you can say, ""Oh, actually, I already did it, and you just don't have the data,"" or, ""Thanks, let me set a reminder to go get that done."" So that's my baby, my project that I was working on before I got pulled into the vaccine work. Lukas: That's so great. If I'm a member of Blue Shield, could I use it? Alyssa: Yeah. It's rolled out to, I think, about 50,000 people right now, and we are working on it and hopefully going to ramp it up to more Blue Shield members in the future. Lukas: That's so cool. I would think, for myself, there's things that no ML algorithm would be needed to tell me would make me healthier. Alyssa: Yeah, a lot of it is pretty simple, right, like, ""Hey, you did or you didn't do this."" It doesn't require machine learning. But what has required machine learning type of thinking in this project is around, frankly, data-cleaning. So we may have multiple records of the same medication for the same member, right, like I get prescribed birth control every single month and I have multiple prescriptions assigned that overlap with each other, so if you look at the last 10 years, if I'm displaying that to me as Alyssa Simpson, I don't want a list of 10 years of data worth of every medication I've ever subscribed. I want you to group it logically by the brand name or the medication type, and so that is a data-cleaning machine learning exercise around grouping medications together because one pharmacy may have reported it with slightly different wording or dosage or something versus another pharmacy or another doctor, and so organizing that information, machine learning could be and natural language processing could be super useful there. Lukas: I guess I should say, and we were joking about this, but my wife runs a company called PicnicHealth that does a lot of this stuff, so- Alyssa: Which does a bang-up job, by all accounts. Lukas: Yeah, it does. In my unbiased opinion, it's fantastic at doing this kind of work. Alyssa: I've heard. I've heard they're really good at that, yeah. Lukas: I guess, why do you think that these health records end up so hard to structure? Alyssa: Ask your wife. She knows way better than I do. From my limited understanding, I think it's because the healthcare system in the United States is just really, really fragmented and there's so many different entities in the data chain. I'll use a personal example. This week, I get headaches, and a doctor prescribed me a new medication. I have had headaches for a long time, so I've cycled through all the normal ones that someone would use, and this one is an expensive medication and it's out of the bounds of normal, and she prescribed it a week ago Monday to me. My pharmacy followed up with me that same day saying, ""Hey, we're working with your doctor and your insurance to get this covered, and we'll get it out to you."" A couple days go by, I still don't have my medication. I followed up, and they say they're working on it, blah, blah, blah. But the number of different entities that have to touch or approve this end to end from my doctor and me having a conversation and her prescribing it to me getting it is, I'm not joking, probably 10 different systems, right? It has to go from the electronic medical record that my doctor is using, and that has to go into an intermediary system that goes in between the doctor's office or the hospital and the insurance company, and so there's a third party in between that processes what's called prior authorizations. Then the insurance company has to ... We don't directly integrate with that particular third party, and so we have to do some data moving around in order to get it to the right person in our system to approve that. Then it has to go back to the doctor's office, but then there's this pharmacy over here that hasn't been involved in any of this so far, and there's a bunch of systems that they use in between. The short answer is there's a lot of different systems involved, and they don't all talk to each other very successfully, and the data gets manipulated and changed, and there's different standards and different data systems. Even though there are standards around healthcare data, I think July they go into effect in California in terms of being mandated to follow certain types of standards for certain narrow use cases, but there's just not a ton of structure for these different data types, and so they've evolved in different ways. Even the electronic medical record, we're dealing with this in the vaccine world, how does your doctor know that you've been given a vaccine? Well, that's kind of a challenge because let's say you got it at Walgreens. They may have taken your insurance and then maybe they submitted a claim to your insurance, maybe they didn't do that if you didn't have insurance, but they are not reporting it back to your doctor's system. Anyway, there's a lot of different software systems that are being used and there is not standards, whereas if you look at different countries that have more nationalized healthcare systems, there's one, two systems, and so there's just a lot less fragmentation, whereas California there's 8000 different providers and there's 10 major electronic medical record systems, three of which are really big, but there's a long tail for the rest of them. A place like Walgreens doesn't use electronic health records system or Safeway. They are a pharmacy. They use pharmacy systems, which are different than the hospital systems. So that's a long answer to your question, but it's the basic data. Lukas: Yeah. It's funny. I could see how years of working in ML would prepare you well for the American healthcare system. Alyssa: But it's basic data problems. It's not particularly sophisticated machine learning problems. It's data hygiene. Lukas: Right, right. Although it seems like that's the problem everywhere, right? Alyssa: Yeah, that's the probably everywhere. Yeah, exactly. Lukas: Were there other surprises going from, I guess, a start-up to an insurance company? How similar is your job doing product there? Alyssa: I think product management is similar no matter where you do it. It's always balancing stakeholders and priorities, and the day-to-day is certainly different in different types of companies, but I think fundamentally my skillsets are the same and the job I do is roughly the same. I think the problem space is really different and the excitement I get around it is really different. To answer your question earlier around how do people navigate into working on problems that they really want to work on and really love and how do you do that is follow where your interests are. I'm thrilled to get into the weeds of doing data munging, and I personally wrote, I think, 500 different data validation rules for prescription organizing, looking through hundreds of records of different types of prescriptions and how to organize some basic data hygiene rules. That was super fun, and I was thrilled to do it. It was painful work, and, certainly, I have other skills, but I was real excited about the problem that we were solving, which was launching a cogent experience to my friends and families who are members of Blue Shield around being able to look at their longitudinal patient record and not have all this messy duplication of data that they're showing, and so ... Sorry, I got a little off-track, but- Lukas: No, no. I totally, totally relate to what you're saying, and I think that's incredibly good advice. Alyssa: When I was at Appen and Figure Eight, some of the problems we worked on were super interesting and awesome and others weren't as close to my heart. We were optimizing advertising dollars or whatever else, and those are things that I just get less excited about, personally. Lukas: Totally. Well, I guess we always end with these questions, but it's funny because they're so relevant to your book, because what we want to talk about on this podcast is really just making ML work in the real world. But I want to ask them to you and get your take from all of the research that you've really done, and maybe get as specific as you can, but what do you think is an underrated aspect of machine learning that you think people should pay more attention to than they currently are? Alyssa: I think teamwork. We talked a little bit about this, but it's really teamwork. I think there's a misconception that machine learning work is pretty solitary and you can teach yourself to do it or you can do it by yourself on a laptop or whatever, but it's a team in order to deploy anything functional that matters, and it takes a lot of different skillsets. For the team to work together successfully, it's really around best practices of any team functioning successfully and has less to do with machine learning, but I think that often gets overlooked because there's a lot of focus on the technology and the right hard skills and the right technical systems. I think it's really easy to overlook the team dynamics of getting people to work together well, whether that's quality engineers or data folks or project managers or designers or scrum masters. You need a team of people who trust each other. We certainly have plenty of those problems at Blue Shield or any team that I've ever worked on, where people don't necessarily trust each other and they may be critical of others' work or they may have communication challenges or whatever. Particularly remote, some of those things are harder to smooth over, but for successful machine learning teams that I've worked with, they have high trust, they have high collaboration and cooperation with a diverse group of people, and they welcome outside ideas and people who are willing to roll up their sleeves and get dirty. Lukas: If you think of ML practitioners that you've worked with, for someone that's listening to this, is there any resources that you'd point them to to become a better team member? Has there been a book that you've read or an article that's helped you with this? Alyssa: One of the books that was recommended to me that I really like around teamwork is called Turn the Ship Around, and it's a book that goes behind the scenes of a nuclear warship that was being deployed, and it was written by the captain of that ship. He came in, and he took over the ship, and it was a low-performing team, but, at the end of the day, it was a nuclear ship, and I'm going to totally botch all of the military stuff and get it completely wrong, but really important to do it well, can't screw it up. Lukas: Yeah, yeah, totally. Alyssa: The team hadn't been collaborating well, and he goes behind the scenes and talks about his time literally turning the ship around to get it ready for deployment, to go back out into doing whatever it's supposed to be doing, but it couldn't leave the harbor until it passed all its safety checks and the team was functioning better. They were working on this top-down approach and everyone covering their own butt and not necessarily really thinking critically about what they were being asked to do and how to do it better for the right outcomes. Anyway, I love this book, and I think it applies in business and all sorts of different settings, particularly machine learning, because it's high stakes often, what machine learning projects are being asked to do. The problems are big, and they're important, and they're worthy of solving, but they can also have pretty dangerous or negative consequences if they're not done well, and so this is a book with an analogy that I like to a nuclear warship because it's an important problem and it requires a huge team of people collaborating towards the right outcomes. Lukas: Wow, I love it. Oh, I'm going to read that book. Alyssa: I'll send it to you. Lukas: Awesome. The question that we always end with is ... And this is what you spent, I think, most of your career on, so I'm curious what you think is the biggest thing here. But we always ask what's the biggest challenge of making machine learning work in the real world or where there's specific pitfalls where you see machine learning projects fail. Alyssa: Yeah. We certainly talk a lot about this in our book. I think a few major areas are not having the right team, not having the right problem, not having the right data, and ... I don't know. I could go on. I think those three are probably the big ones. The data to me is often the long pole. Lukas: I guess, for more, read the book. Alyssa: For more, read the book, yeah. Lukas: We'll put a link to it, and, yeah, you should read it. Thank you so much. I really appreciate it. Alyssa: Thanks, Lukas. It's a pleasure to be here, as always, with you. Lukas: Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun, and it's especially fun for me when I can actually hear from the people that are listening to these episodes. So if you wouldn't mind leaving a comment and telling me what you think or starting a conversation, that would make me inspired to do more of these episodes, and, also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",8143 +Sean Taylor — Business Decision Problems,https://www.youtube.com/watch?v=ceCQh73dU98,2741,2021-05-13,"Sean: We focus so much effort on training models, getting features, on all our crazy architectures. The space of models that we can consider is increasing rapidly, but we still are bottlenecked on ""Is this model better than the one that we already had?"" Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world, and I'm your host, Lukas Biewald. Today, I'm talking with Sean Taylor, who's the Head of Rideshare Labs at Lyft (Note: At time of recording). Previously, he was a research scientist on Facebook's Core Data Science team, and before that, he got his PhD in Information Systems at NYU's Stern School of Business. He also has a BS in Economics from the University of Pennsylvania, and he tells me that he prefers R to Python, so I'm excited to get into that with him today. I guess where I wanted to start is the stuff you're working on now on ride sharing at Lyft. I mean, my first question is just, for people who haven't thought deeply about this, how does data science and ML factor into a ride sharing app that probably everyone has used? What are the pieces that matter, and what role does data science and ML play? Sean: Yeah, that's a great question. I think it's a pretty abstract concept because you just tell an app where you want to go and a driver shows up, and there's a lot of things that happen under the hood to enable that. I think of Lyft as a stack of algorithms that all add up to a driver arriving when and where you want. So, that driver showing up there is just a sequence of well-made decisions, and you can trace those decisions back as far as you want, all the way to when we acquired that driver and signed them up to drive for Lyft, and when we acquired the rider and got them to install the app and decide to use it. All those decisions added up to that match that we got in the marketplace. On the actual matching at the time of the ride request, I would think about it as, well, there's the map. We have to have a high quality map. On top of the map, we come up with ETA estimate. So, how long will it take a driver to get to a rider. That helps us perform a more efficient matching. Then there's a dispatch algorithm which actually performs the matching. There's a wide set of available drivers for some ride requests, so we have to decide which one is the best driver to send. Then also, we have to decide on a price. Pricing is a core algorithm for Lyft. On top of planned pricing, there's adaptive pricing. We have to respond to marketplace conditions to try to make sure the market stays liquid, so that's an algorithm that we have to run. Then I guess on top of that, we'll give drivers incentives, we give riders coupons, there's algorithms to decide how we disperse those. So, it's just a wide variety of little mini algorithms, all the way down to just, now we have say, we're predicting where you're headed, so that when you open up the app, maybe we can be intelligent about what shows up on the screen. It's a lot. I think a good experience is the conjunction of all those good decisions made, so if any one of them goes wrong, it can be a very bad experience. I think of the Lyft problem as more of quality control, in a way. The product itself is pretty exchangeable. We have competitors. It's pretty... you have other ways to get where you need to go. So really, it's all about making sure that those decisions are made really reliably. Every one of those decisions is powered by some estimate of some state of the world, right? So, the ETA estimate is probably the most tangible. How long is it going to take a driver to get to a specific spot on the map right now? But we have to estimate all kinds of other quantities of interests, like ""How will riders respond to higher or lower prices? How will they respond to higher or lower wait times?"" They're all combinations of machine learning and causal inference problems, in a way, because ultimately, at the end of the day, we're going to change something. We don't want to just train on some... it's not like a supervised learning problem. We actually want to say, what would happen if we did this differently? What would happen if we sent this other driver instead? And so, the problems are quite a bit more complex than just a standard predictive modeling set up. Lukas: I mean, how do you think about that, right? Changing a price is such an interesting thing. It doesn't fit... definitely, I agree, it doesn't fit neatly into a normal ML prediction. Do you have training data that you can run on, or how do you even model that? Sean: Yeah, that's a super interesting question where you have... one way to think about it for machine learning people that I like as a way to explain it is that there are features that are not under your control, and then there are features that are under your control, and you want to think about modeling them differently. It's important that the features under your control are subject to some randomization in order to be able to estimate a causal quantity of interest. If you want... if you really want to know what's going to happen when you raise prices, then you have to raise prices sometimes. Part of the problem with training models like that is you have to let the causal part of the model speak a little bit more than the features. There's going to be other things that predict conversion rate on a ride much better than price. Price is a powerful predictor, but if you don't randomize it, then there'll be other things that could explain the changing conversion rate that are correlated with price, like, say, ride distance. So, controlling for a rich set of things, having randomization of the variable is really important, but also there's a whole bunch of modeling architectures that we employ that help let the causal part of the model speak a little bit more. There's some really exciting work going on in, say... people call these heterogeneous treatment effects models. There's even neural network architectures for doing these kinds of things these days. But at the end of the day, you have to have been running some experiment in the background in order to make those models be able to tell you what's going to happen when you change the state of the world in some way. Lukas: I mean, I would think price specifically, is obviously a sensitive topic for users, but also probably even way more for the driver. Do you think about other considerations there? Do you put constraints around yourself around setting price, outside of just modeling most efficient market or something like that? Sean: I think that one of the core problems for Lyft, and it's very pervasive, is ""What's your objective function for the business?"" You have to... at some point, you have all these algorithms that are all working together. What common goal are they working toward? At the end of the day, there's some kind of welfare being created by the system, and it's going to be allocated... some of the welfare is being allocated to the rider, some to the driver, and some to Lyft, which we'll take as profit. So we have to figure out where we're going to split those things, and there's trade-offs in splitting them different ways. If we just greedily took all the objective for ourself, we'd charge really high prices, pay the drivers almost nothing, and no one would use our platform. There's these short-term, long-term trade-offs. So, finding the right balance there is really important. One of the ways that we do that is we have a lot of guardrails in the system. We'll say, we would really prefer if certain things never exceeded some tolerances, and that's a way of us heuristically applying some guidelines that help the algorithm stay in a safe place. For driver earnings, for instance, we really like to increase driver earnings as much as we can. One way to do that is to just have people pay more. A better way to do it for everybody is to improve the efficiency of the system. So, if we can get drivers to have a passenger in the car more often, then they'd just make more money and the total surplus is greater for everybody. So, that should really be our goal. When we think about pricing, it's the zero sum game version of the thing. We would like to make the sum of the game larger for everybody, so we split a bigger pie. A lot of our algorithmic improvements that we think about are more on the efficiency side than they are on, ""Can we take more money from this person and give it to this person?"" Because that just... you run out of options there very quickly and you end up... somebody's unhappy. Lukas: Right. That makes sense. I guess, probably, a loss function that everyone can relate to is the ETA estimation, right? We've all been in a rush and had a car come late. You had a really nice post about this, and thinking about what the right loss function is, but I wonder if you could say how you think about what it means to have an accurate ETA function? Sean: Yeah. I think that that's a fascinating statistical topic. I mean, that post was about, there's a wide space of loss functions that all have some desirable properties of producing an unbiased estimate of ETA. You might even think about applying a bias estimator. Maybe I don't care about getting it accurate. I care about giving the user an upper bound or something like that, so you could think about some quantile loss, but ultimately, ETA predictions are inputs into some downstream algorithms. We've decomposed the optimization problem into pieces. The ETA estimates are a thing where we have to have a contract with the dispatch system, which is that our ETA estimates have some statistical properties. So, unbiased-ness is a really key piece there because we're going to run an optimization on top of those predicted values, and if we say, ""Hey, we're going to add a little bit of buffer on top so that the rider doesn't have a bad experience thinking that we underestimated"", that would be bad for the downstream optimization. So, the algorithm consumption of the estimates and the human consumption of the estimates are a little bit at odds on what would be desirable. So, I think we tend to prefer to get the statistical unbiased-ness right, and then figure out how to make the user experience better in a separate layer as much as possible. I think that historically, we played with displaying ranges of ETAs. A better answer to this question, it's not ""Estimate the thing differently"", but just be honest about the distribution of errors that you're likely to make in practice. Lukas: Sure. Sean: Yeah. Lukas: Well, tell me this. What loss function do you use? I mean, unbiased could mean different things depending on the context, right? Sean: Personally, I haven't worked on our ETA estimation problem. We have a really strong team of researchers there doing some really interesting stuff, but yeah, I haven't worked on it, so I don't know what we landed on. I know that we're at the point now where it's pretty hard to eke out gains in that algorithm. I think it's a thing where most of the effort is on just accuracy. One of the super interesting things about ETA is that not all accuracy is equal, so being correct about ETA in certain situations is more pivotal for your downstream optimization than others. You might think of that as label weights in some way. So, there are cases where getting the ETA right could really make the difference between getting the routing decision right or wrong in cases where you're basically going to do the same thing either way. Lukas: Could you give me an example of that? It's hard for him to picture what... I mean, of course, that's the situation for any algorithm, but what's the case where ETA is super crucial? Sean: Yeah. So, say that there are two drivers that we could potentially route to a rider. In cases where the estimates ended up being ordered the same, then the estimates aren't pivotal, right? So, there's a wide class of estimates that would rank them the same, and so always dispatch the same driver, but in markets where we have a lot of options and there's lots of drivers available, then you start to make mistakes, right? So, it's like a ranking problem, and if you invert the ranking, because the estimate was off in some cases. So, in thicker markets, we have opportunities to do better. We have opportunities also to do worse because we're getting the ordering of the drivers that's efficient to send wrong. Lukas: I see. Interesting. Sean: There's also a weird bias problem in the data that we have for ETA. We only observe the drivers that drive certain routes. So, they only drive to places that they've been routed to. So, estimated ETA for segments of the road that we don't observe drivers on, it's a set of missing data. That missingness is not at random. They might not be driving a certain place because we're not routing them somewhere, because we think the ETA estimate is really long, but it could now be short. So, there's a sense in which you'd prefer if you collected your data under a little bit of extra randomization or noise to get a better estimator. It's an interesting bias training set problem that I think is a little underrated. We haven't quite figured out what to do about that. Lukas: That does seem super tricky. I guess it's probably hard to run random experiments to collect more data. I think that might make people frustrated. Sean: Yes, that's right. It's very analogous, I mean, I used to work at Facebook...One of the things that you'd worry about is you're ranking a story really low in newsfeed, and no one ever sees it, so they don't engage with it. So, your training algorithm doesn't know that there's some features in there that could say, ""Hey, this is really good. We should be displaying this at the top."" So, you can end up in these feedback loops where some friends of yours might... you might not ever see their posts again, because they just aren't getting any eyeballs on their posts anymore. I don't know if that actually played out at Facebook, but it's a super similar problem, is that you have to acknowledge that your training data isn't some random sample of what you're looking for. Lukas: Right, right, right. I guess when I look at the ride share challenges that you mentioned, they seem like situations where you have pretty structured data coming in, and maybe lots and lots of data, and you have to deploy into a high volume production. It seems like a case where neural nets might struggle a little bit. Have you found that mostly neural nets work better than maybe older, I guess older is the wrong word, maybe less complicated algorithms? Sean: I would say... so, we do have a bias for simpler solutions. I think that's for good reasons of needing to keep things reliable, and historically, people at Lyft have gotten a lot of successful results with tree-based models. So, things like LightGBM and XGBoost are pretty popular techniques for supervised learning problems. I think that's for good reasons. I think trees do well with geospatial data. Latitude and longitude and time are things that trees can find good segmentations of. So, the features are naturally encoded very well. The representation is learned by tree very effectively, and so neural networks might provide a boost over that in the long run if you have a lot of data, but you have this thing that learns really quickly and doesn't overfit too much. So, it's an easy drop-in thing to use. I think that we're moving toward using neural networks, and in some cases, gradually. I think, yeah, we are trying to sort out some of these deployment challenges and making sure that they run reliably. Yeah, I think all the model quality control stuff is something you have to relearn a little bit as you move to a new modeling paradigm. Lukas: I guess you mentioned online, at one point, that your team uses entirely PyTorch. Is that right, and could you talk about the trade-offs there? Sean: So, part of it is historical. I worked at Facebook and I did a hack-a-month at FAIR. That was right when they were deploying PyTorch for the first time. I learned about it before TensorFlow, so it wasn't like I thought PyTorch was better than TensorFlow. Fast forward to last year, my team was working on...we're building a forecasting tool that has a plan built into the forecast, so we can change some policy variables and have the forecast reflect the change. So, we might say, ""Hey, we increased our coupon in volume and that's going to increase demand."" So, we'd like the forecast to reflect that, forecast with some causal effects baked in. If you can produce a forecast like that, one of the natural things that you'd like to do with it is actually run an optimization on top of it. So, you'd say, ""I will produce this forecast"" and then actually optimize the plan to make the forecast look as good as I would like it to look. If you're doing that, a really desirable property is that the model that you fit is the differentiable object, so that you can use basically... the same methods that you use for optimizing the fit of the model, you can use for optimizing the policy variables that you're plugging into the model. So, we really wanted to be able to produce a Python function that we had fit from data, but that was differentiable. So, having the model be done in something that was auto gradable was really important. I'm a big Stan fan and I like Bayesian modeling, but a lot of the Bayesian modeling tools don't naturally just produce this object that is differentiable. So, we're like, okay, well, we should work in some space where we have these auto grad tools available. It's been a bit of a trade-off. I think we're doing things that look a lot like Bayesian models, but on top of PyTorch. We're having to invent a lot of ways to do that ourselves, that would have been a lot easier if we did something PyMC or Stan. It's been a little bit of a challenge, but the upside has been a lot of modeling flexibility and also the ability to borrow from what all the neural network people are doing for improving the speed and reliability of fitting. So, there's a little bit of...it's fun to do things that look like neural networks, but are not. We're not using them to fit. There aren't any layers or pooling, or anything interesting going on. They're very similar. They're just the kind of models that you would fit in R, but we really needed this engineering requirement, that we would produce this model that had this nice property of being able to run optimizations and grading. Getting the gradients is a really beautiful thing at a place like Lyft, because we care about marginal effects of everything. So, if you want to know what the lifetime value of getting an additional rider is, which is a very common thing in business... What's your marginal benefit of getting one more person on the platform? With a differentiable model, it's very easy to do queries like that. We can just say, ""What's the gradient of the total lifetime value of Lyft?"", which is something we can estimate with the model. We can do the forecast, add up all the future revenue, discount it, and then actually just look at the gradient with that variable, with respect to every rider activation, and say what that is. So, PyTorch was a really natural fit for doing those kinds of queries. So, yeah, it's a little bit of, we got really low level to solve a problem and I think sometimes we regret being that low level. Lukas: That's so interesting. So, it wasn't PyTorch versus TensorFlow. It's PyTorch versus a Bayesian framework. It also sounds like you're using PyTorch essentially for data science, because you want the auto grad... or you want the gradients to be able to pull them out. I guess, where have been the pain points? Where has that felt frustrating compared to what you've done in the past? Sean: I think part of it is that we bet on... the optimizers that are used for neural networks are not particularly great for some of them. A lot of the models that we fit are pretty small fit into memory. We should be using some second order methods. We've struggled a little bit with confirming that we're at a global optimum for the model. These are models that we should be able to confirm that. So, if we had done it in a more traditional model package, then we might've ended up with a more stable optimization procedure. I think the modeling flexibility that you get from PyTorch is partly... a cost that you pay is that everything's pretty low level unless you have these higher level abstractions. So, we had to build a lot of those abstractions ourselves. So, things like building spline basis expansion and building ways to... We actually have 40 or 50 models that compose together, and we had to build a way to compose a bunch of models so that they become one big graph of models. We had to build a lot of that stuff ourselves. We have a couple of people on that team that just got really interested in that part of the problem. I hope that one day we can open source the modeling architecture. The other super interesting pain point that caused us to develop something that I think was pretty interesting, was that everything in our system is a tensor. Tensors are really natural representation of marketplace data because it has a regular structure to it. So, you can say geography and time are two dimensions of the tensor, and you might add other dimensions, and that neatly encapsulates a lot of the kind of data that we capture. We ended up creating a labeled tensor implementation that we find it really useful to... It's a tidy data frame in R, but it's a tensor, and so we can use them as just variables in the system, and compose them and multiply them, and do operations on top of them. I later found out that there's this label... there were a bunch of these labeled tensor packages out there that do similar things. I think that that was something that we didn't realize we needed to build, but keeping track of all the dimensions of all the tensors that we were passing around became a first-class problem very quickly. Lukas: It all sounds like you want to use data frames, right? Sean: Yeah. They're data frames, except that they're dense, right? So, you can guarantee that you always have... for any- Lukas: Oh, I see. Sean: ... pair of coordinates, you always have a value. So, it's like a special class of data frames where you know some properties are true about them. Lukas: I guess this is a more open-ended question that I hadn't planned to ask, but I mean, since you've done a lot of Python and R, I'm curious how you compare the two, if you have one that feels more natural, that you like to live in, or ...? Sean: Yeah, I think this'll probably be pretty controversial, but I do everything in R, until I can't anymore, because I... Lukas: That is controversial. Interesting. Sean: I think that the Tidyverse people have figured out a lot of the interactive data analysis stuff. It's just much more first-class in R. One of the things that's an interesting consequence of R's syntax is that the lack of white space sensitivity and some of the ability to just use unbounded variables means that you just have a lot less typing to do similar things. I'll poke fun at Wes because I've had this conversation with him. I think the pandas API could use a little love, and if we could reinvent pandas from scratch and do Python data frames again, we'd probably do it a little differently and something with a little bit less surface area for developers to... Hadley is a designer, Hadley Wickham is the creator of dplyr and a lot of the tidyverse packages. I think he thinks really deeply about these micro interactions that people have with the code. What are you actually do... what are you trying to accomplish? What's the minimum way to get there? Then also, is it going to stick in your brain? Are you going to remember to do it next time? So, I've just found that that fit my brain a little better, but all the production code that we write at Lyft is in Python, so I find myself porting some of my analysis in R over to Python quite commonly. Lukas: Can you give me an example of where data frames frustrate you? Or where pandas data frames are frustrating? Sean: Sure. So, one thing that is a little annoying is having to... Some of the operations will emit data frames, some of the operations will emit a series, depending on what kind of aggregation that you're doing. So, this is a functional programming no-no, right? dplyr is designed in the opposite way where there's very standard interface. Most of the functions take a data frame as the first input and always return a data frame, and that allows you to do this chaining thing. If you look up method chaining in panda, you'll find a couple of good articles on how to do it. It's a real stretch to do chaining in pandas, where you can apply a series of operations and read through them, and you can do this, but it just doesn't look as readable, and it requires a lot of clunkiness, but the .pipe operator in pandas is something that I use a lot when I'm using pandas, because I think it does what I like about dplyr. It just requires a lot of you to write your own code, to fill in some of the missing pieces. I think reshaping data frames from long to wide is just dramatically easier in R because that interface is a little bit simpler, like stack and unstack operations. In Ruby, they call it principle of least surprise. You should always... the API should return something that is unsurprising to you. Sometimes, I think some of the stuff in Python is most surprising. You're like, ""How did I get here with this object? I have no idea."" This is a long rant and a long complaint, but I think we can get there. There's plenty of great Python developers that are working on this, but I think that we made some design decisions early on that made it a little bit challenging to create these expressive interfaces. Lukas: Yeah. It's so funny. So, my experience was, I wrote code in mostly R for years, and I always found R a little baffling. When I switched to Python, I was so happy, and it made so much more sense to me. It's really interesting that you feel exactly the opposite. I wonder what's different about our brains or what we were trying to do. I think maybe functional languages are more natural for you, and I feel like all my smartest friends, that's the case. Maybe that's just going on. I mean, the thing for R that I always missed was I just felt like the plotting was so much more natural than Python. I feel like I still have to look up Python's plotting stuff. It's interesting that you don't even mention that as an issue. Sean: Yeah. I hate Matplotlib a lot. I would complain about that to anybody. I think Altair really solved that problem for me. I think- Lukas: Interesting. Sean: Jake VanderPlas wrote a really nice package. It's very ggplot-like in concept. In syntax, it's a little different, but I think it's a close map, so it's pretty easy. But I had the opposite. I was a Python developer since 2004 through grad school. I spent a long time in Python. I started learning R in grad school. It was my later language, but I like it more, so yeah. Maybe it is just you're... some people have a certain kind of brain that fits one thing or the other. Lukas: Well, cool. I also wanted to ask you about the Prophet project that you worked on at Facebook. Could you say a little about what that did and why you made it? Sean: Sure. Prophet is a time series forecasting package. It was built because we had some applications internally at Facebook that we didn't have good tools for. At the time, I was on the core data science team looking for interesting high-impact problems to work on. We had a couple people come to us, just with forecasting problems, I looked around...I was like, ""Forecasting can't be that hard"", and I started to Google around and look for what tools are available, and I really felt like the tooling landscape was a little primitive. In particular, there's one interesting aspect of business time series that's just difficult to model traditionally, which is this multi-period seasonality. So, you have a yearly cycle and data is super common, a weekly cycle is super common. You just end up with needing to think about carefully modeling these kinds of... they're just features that can be extracted from time, but they're not easy to do in an auto regression or exponential smoothing framework. So, I worked with... Ben Leetham, I have to give a great call out for, because I think he invented all the important stuff in Prophet. That project was going really poorly until Ben got involved and helped me solve a couple of really key problems there. Then what we figured out was that we just had this class of time series problems that are really common in practice. It's actually a really constrained modeling space. It's almost like an architecture for time series models. We just said, ""Hey, there's a small set of models that capture a lot of data that we see in practice"", and that prior over the models is a really useful thing to know, because it means... Time series data are always data constrained. You might have a year... you might have 300 observations, 400 observations. You're not talking about something you can learn a lot from the data. You have to bring a lot of priors to a time series problem. By coming up with reasonable priors for what that should be... and if you look at the Prophet code, it's got hard-coded parameters that are our priors over what we think is likely to happen in... It's not an elegant model in the sense of that, it's not super general. It's actually very specific, but that happens to work well in practice. Sometimes I just call it a bag of heuristics that we cobbled together, and I think real time series modelers probably get a little frustrated with us for having empirical success from something that's not as principled as the work that they've been doing, but people get a lot of value out of it. Part of it is just that they don't really want to learn about time series modeling that much. They'd prefer to just get it done and move on to another problem. So, Prophet provides a very easy way to get there. Lukas: I have a feeling of a lot of people listening to this might find this useful. Could you say what's the case where Prophet's going to do well and where it might not do well? Sean: Yeah. So, Prophet is built on a lot of local smoothness assumptions. So, if your time series jumps around a lot or is very random, or it has a non-human periodicity to it, then it's unlikely to work. It's really designed for these human-generated time series, human behavior-generated time series. So like web data where you're counting how many visits come to a website is bread and butter for Prophet, because it's highly seasonal. It has all these very predictable patterns to it, but those patterns need to be encoded in a way that allows them all to extrapolate them. When I see time series that come from more physical processes, really high-frequency stuff... stuff that jumps around, stuff with a lot of really abrupt changes in it, which violate this local smoothness idea... then you can see right away. My prior can be expressed as looking at a time. When someone shows me a time series and they're like, ""Would Prophet work on this?"" I know right away if it will or not. A lot of it's just knowing what human-generated data looks like from having seen it a bunch of times. Lukas: So, you're essentially encoding, somehow, earth, human things, like week and month, and year. So, it's designed for more demand forecasting versus the position of Jupiter's moons. Is that fair? Sean: Yeah. I think that's right. I think when we first released Prophet, Andrew Gelman, on his blog, it was very flattering to get mentioned by him, he was like, ""I'll show you a time series that Prophet won't do well for,"" and it was some physical process. I forget what it was. I think it was lemur population or something like that. It was one of these physical processes, like population ecology, where it has a chaotic period to it, because it has a feedback loop built into it. So, the period is not regular, and it's like, well, if the periods are not regular, then there's no way a model that's trying to learn a regular period structure is ever going to fit that. So, I think we ended up having to admit that, ""Yeah, sorry, Andrew, you can't forecast lemur population using Prophet"", but I think that we're fine with that. It's an 80/20 thing. We'd like to capture the kinds of problems that we see in practice. Lukas: So, can you say a little bit about what you're doing under the hood with Prophet? Sean: Yeah. There's probably two or three tricks that I think add up to the whole thing. Probably the most important trick is just that we have these trend change points. The actual Prophet forecasting model can be really simple. If you strip out the seasonality, it's just a piecewise linear regression. Making a linear regression extrapolate well is challenging because you don't really always know how much of the historical time series to use to fit the slope at the last point. So, you're trying to go into the future, you need to know the slope at that last point where that's coming from. What we do is we introduced this idea that the slope can change at various points in the past, and that we prefer those changes to be sparse. So, we're just using a one penalty in order to do that. That's a really standard trick in machine learning, and what that does is it comes up with a pretty, I would say, parsimonious representation of the trend of the time series, which is a sequence of lines that fit together, and the last line segment is the slope into the future. So, that actually works quite well. It's very similar to exponential smoothing procedures, which are getting the local slope that you're trying to use to extrapolate from the more recent data, rather than from the far past. It's just a sparse version of that, so that's one big trick. Lukas: But then how does that model periodic effects into the future? Or is that not part of its thing that it's trying to do? Sean: Oh yeah. So, the seasonality is just applied additively. At its core, Prophet is just a generalized additive model. So, very similar to... a lot of gam packages will fit all kinds of stuff that looks like Prophet. It's just that they're not really designed to extrapolate well. They fit, they interpolate well, because that's what gams... the loss function for gams is capturing that. For Prophet, we just had to make these modifications in order to get the extrapolation performance. And really, if you think about it, it's all about controlling the complexity of the model that you're fitting close to the boundary of the data, which is... because it's extrapolation, you really don't want it to get over fit at the last part where you're trying to go past it. In typical machine learning, we do way more interpolation than extrapolation, so we commonly don't think about controlling complexity at any particular point. We just want the best model, but in forecasting, you really prefer simple models when you're going off of the data that you've seen already. Lukas: Totally. I guess that's a good segue into one more topic I wanted to ask you about, which is the election forecasting. You've talked about, or thought about, election forecasting with using prediction markets, which is something that I think probably me and a lot of people listening to this have thought about. I guess I'm just curious. I mean, the question that's top of mind right now, and this is probably going to be out of date as soon as we release this, is we have FiveThirtyEight and all the election models showing a really high percent chance for Biden compared to the prediction markets and the betting markets. Do you have any thoughts on how those two things have diverged and why? Sean: Yeah. That's a really interesting question. I think the prediction market, people... Dave Rothschild at MSR was a really big believer in the prediction markets loss cycle, and has since switched over to polling. I'd love to... I think he'd be a better person to tell you why prediction markets are failing to do this, but I think one part of it that I find interesting is that prediction markets...I think one reasonable use case for them is to do emotional hedging. You could say, ""Oh man, it would be the worst thing in the world of Trump won, so I'm going to go bet every cent that I have on him winning in a prediction market, so that if he wins, I'm just going to win a lot of money."" Not every prediction market participant is trying to maximize earnings. They can be hedging and it's a tool for hedging. So, you might think of, okay, so part of the difference in price could... it could be suppressed because of the... Lukas: But shouldn't some kind of... I mean, I'm out of my depth here a little bit, but isn't there some kind of efficient market hypothesis that someone would exploit the emotional hedging to make themselves a lot of money? Sean: Yeah, that's true. If the constitution of the market were... if you had an infinite population of traders, then yeah, I think you'd get there, but without perfect... without a lot of liquidity, if most of the people... all the market stuff depends on having a lot of people, and they're all... if a certain fraction of them were Prophet motivated, then I think you're good. Part of it is also transaction costs. PredictIt, for instance, has a 20% fee for removing... for taking your money out, so it makes the incentives not quite the same as trading in a financial market. Yeah, I don't know. I think it's an interesting empirical puzzle because also, if you go to PredictIt and you look at the state level predictions, I think they align quite well with FiveThirtyEight, but the aggregate one, it doesn't. To me, that feels like the hedging explanation is my most is my favorite way to explain it, but I don't have a better explanation than that. Lukas: It sounds like, to me, then, you're siding with the poll aggregation versus prediction markets. Sean: Well, I am a big believer in polls. I think it's a really well understood technology that we've been deploying for a long time and there's a lot of great science behind it. I think you see Elliot Morris at The Economist working with Andrew Gelman and doing best-of-breed Bayesian modeling of the polls. At the end of the day, I think of this as there's some latent variable, which is intention to vote for one candidate or the other, that we're just getting noisy observations from. When you have a latent variable that you don't observe, you want to pool as much information that you have about that as you can, and you want to try to de-bias it as much as you can. We've gotten quite good at that. I think that the real epistemological problem here is whether polls mean what we hope them to mean. I think it might just be that people answer polls differently now, or think about them differently. This was the Shy Trump Voter hypothesis from 2016, is maybe people legitimately aren't telling you how they're really going to vote. In a world where that breaks down, I think polls become a lot less credible as a source of information. So, I think we always have to take on faith that people are answering these things in accordance with their beliefs, at least most of the time. I hope that that will sustain itself because I can't even really imagine a world four years from now, or eight years from now where we actually don't have any credible estimates of these things. We've gotten used to feeling some level of certainty about where the election stands. Lukas: Well, I guess, what role then... if you believe the polls, I guess what role would prediction markets play, or could they play, in an election forecasting? Sean: Certainly, the polls are informing the participants of the prediction markets, right? I can't imagine that they're coming up with their beliefs... People in the prediction markets have some subjective belief about what's going to happen, and that's informed by some information about the world. Whether that's just them talking to their friends or reading the news or whatever, or actually just analyzing data, I think at the limit, if you really want to do well in a prediction market, you would want to bring as much information as you could to bear on the problem. But also, I guess, this comes up a lot where it's like maybe the people who analyze the data the most are not as willing to participate in the prediction markets. People are always calling on Nate Silver to make large bets about what he's estimated, and he seems a little bit reticent about that. So, I guess, yeah, there is this interesting question of maybe the polls aren't driving the prediction markets as much as much as you think. To be honest with you, I don't really know what's motivating a lot of the people participating in the prediction markets, and whether they're really acting in a profit-motivated way, or they're just gaming. You can think about fantasy football players are doing a similar thing. They're moving some things around on the internet and hoping that they won a little bit of money as a result of it, but they might not be thinking too deeply about it. I'd love to see some research on just actually talking to those people about what their process is and what they're doing. If you go to a website like Metaculus, which I'm a big fan of... it's not a prediction market, but a prediction aggregator... you see a really nice community of people that actually talk about how they end up with the forecast that they came up with. I think that you get a lot of insight from that, like what are they actually doing in practice to figure out the future state of the world? It does look a little bit like this foxes versus hedgehogs things. They just cobble together little bits of information and make more directional changes. Lukas: Yeah. I mean, I guess you can imagine... I mean, Nate Silver is so spectacularly good at articulating what he's doing, but you can imagine somebody who's really good at forecasting, but maybe not as compelling of a writer or as clear of a thinker, doing really well in a prediction market, but not having a famous website. So, it does seem like that could provide them room to shine. Sean: I was a big believer in it and I think I'm just starting to have doubts now. I mean, I built a prediction market a few years ago because I thought that there were a lot of Nate Silver types out there doing this kind of stuff. I guess I just didn't end up...it's really hard to get people to participate in prediction markets. I think this is an underrated aspect of it is. I built one. I tried to get people to use it. It's cognitively costly to create predictions, and especially ones where you're going to have some skin in the game, you're going to even incur more. So, it's not free to get participation in a prediction market. They're doing computation in the background that's expensive to produce their predictions. I think this is an underrated part of the problem, is that in financial markets, we just assume that the incentives to participate far outweigh the cost to the participants, but in prediction markets, I think that the problems that they're solving are cognitively expensive, and the payoffs are a little bit smaller. So, we might be in a world where we get under-participation, so you don't end up with these great stories about markets being amazing aggregators of all available information. Lukas: Totally. Well, we always end with two questions and I want to give you some space to answer these questions. So, our second last question is, what's an underrated aspect of machine learning or data science that you think people should pay more attention to? Sean: Yeah. That one... I always have strong opinions about... and to me, it's very obviously model comparison and evaluation. We focus so much effort on training models, getting features, on all our crazy architectures. The space of models that we can consider is increasing rapidly, but we still are bottlenecked on ""Is this model better than the one that we already had?"" I think that that's a nuanced problem. It's usually a lot of criteria that go into that, and coming up with good model evaluation procedures is hard. It's not just AUC. It's not precision recall curves. That's a part of the problem, but there's just so much more to model comparison, like cost of the model, upkeep, decay, stability, interpretability. I mean, there's just this wide array of things that we'd like about models that we're not really encoding. I just feel like it's always the thing that people, when I'm talking to them, have thought the least about, but it's the part that I'm most interested in. So, that's my very clear answer to that one. Lukas: Interesting. Is there any work that you could point people to if they want to learn more about that? Sean: I mean, I think that the posterior predictive checks stuff in the Bayesian community is getting in the right direction. It's sort of a general approach to inspecting the predictions that a model makes. You see actually in the... Elliot Morrison and Andrew Gelman doing this with their election probability model. They're looking at predictions and trying to see, ""Does this make sense to me, and where can we make improvements?"" So, I think that that's a really fruitful place to look to. I guess the other literature that I point people to is off policy evaluation. Usually, if you have a model, you're going to go and make decisions with it, at some point. Those decisions will add up to some value in some way. The most faithful representation of how good the model is, is if you actually plugged it into your production system and ran an online test, how well would it do? So, off policy evaluation is just an offline way to try to estimate what would happen online if you ran your model in production. It's a hard approximation to make, but if you can do it, then you can be much more sure that your model is the right one for the task that you're going to deploy it for. Lukas: Interesting. So my final question is, what's the biggest practical challenge of making machine learning models useful in the real world? I would say for you, at Lyft, what do you see as the biggest bottleneck to taking a model from research or conception, to use in production? Sean: Good question. I think there's still a lot of really base needs that need to be met. I think getting and collect... getting training data into a shape that the model can be trained on it, I think is still something that... We spend a lot of time just making datasets for consumption of models, and I think that that's something that's still a little bit slow. There's some technology that's helping there, like feature store type ideas. I think that that's a challenge. I think that this, just, model life cycle stuff is still a big thing. I think two people collaborating on a model is a pretty challenging thing these days. I think you see... if one person gets to work alone, they can move much more quickly than they do in a group, but getting a group's worth of effort on a model is a really useful thing. So, I think that decomposing the problem into something that multiple people can work on is a big opportunity. Finally, I think that the monitoring and making sure that things are behaving the way that you'd like in production, that trust when it's running in production. And for us at Lyft, it's like if we screw this up, then the marketplace falls apart and drivers don't make money, and riders don't get rides. It's a really big downside risk to losing reliability. So, getting to the point where we trust the decisions and that we can... So, we end up spending a lot of time just making sure that we're confident that the models are going to do something reasonable in the real world and a lot of layers of testing in between. I think that in the future, I would hope that we can get to a point where that friction starts to go down and we can be a little bit more iterative. Lukas: Awesome. Well, great sentiment to end on. I really appreciate your time. Sean: Yeah Lukas. Thanks for all the great questions. This was super fun. Lukas: Doing these interviews are a lot of fun, and the thing that I really want from these interviews is more people get to listen to them, and the easy way to get more people to listen to them is to give us a review that other people can see. So, if you enjoyed this and you want to help us out a little bit, I would absolutely love it if you gave us a review. Thanks.",8820 +Polly Fordyce — Microfluidic Platforms and Machine Learning,https://www.youtube.com/watch?v=IMS7fNEsyyA,2755,2021-04-29,"... we make these devices called microfluidic devices, that are kind of you can sort of picture the way integrated circuits made it possible to do a lot of electronic computations, and have very small footprint, and that kind of led to this revolution in computer science hardware. We make these microfluidic devices that allow us to do fluidic computations in high throughput in very small footprints. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. Polly is Assistant Professor of Genetics and Bioengineering at Stanford. Her lab's main focus is on developing and applying new microfluidic platforms to create high throughput data, which is crucial to making machine learning work in biology and genetics. I'm super excited to talk to her today. Thank you so much for agreeing to this interview, people have been asking us to get more kind of content on the intersection of biology and machine learning. And it's kind of funny, I'll just say, you told me that you didn't know anything about machine learning, but as we've kind of gone around, we've realized that you're well-respected as someone in biology that there's a lot about machine learning. I don't know if I can trust your self-assessment here, but- That's really nice to hear. I feel like we don't very much about machine learning, but we have been collaborating more and more with experts in machine learning. We're trying to learn as we go. Well, it's funny that I've discovered with our pharma customers and are getting a lot of those lately, I started to realize dropping your name actually gives me like a ton of street credits of vendor. That's so awesome, that's great to hear. I guess, I should say that I feel like I was friends with you from undergrads. It's a little funny, I mean, it's awesome to watch your career trajectory, and it's exciting to talk to you about your work. It's still on my part, right? If I tell any of my students that I know you instantly, it's like I'm a Silicon Valley celebrity, right? I'm like at least in close proximity to it, so it goes both ways. Nice. All right. Well, maybe you could explain kind of at a high level of what your research interests are. You kind of laid it out in the notes, and I tried to do some background research like reading your papers, like I normally do with the machine learning guests, but I found your academic record very impenetrable. So I kind of take a big step back with me and sort of explain what you're doing and why it's important. It's really technical. I guess I would say, a couple of examples of the things that I'm interested in are, the promise of the human genome project a long time ago was this idea that we were going to be able to sequence everybody's genomes. And then we would look at the difference in the sequences of those genomes. And we would instantly be able to say whether a particular mutation in the genome meant that somebody was going to have a particular disease, or maybe they would respond to a particular treatment. And I think the challenge is that the amount of possible variation is really huge. There are so many different variants that we discover, and we still don't really know for the vast majority of variants. Three quarters of variants that we found we have no idea whether they're likely to have a functional effect. That's right, I'm going to start the dumb questions early. You mean variants of genes, different DNA, is that right? Yeah, I mean different letters in the genome, different letters in the genome, right? Different letters in the DNA, okay, got it. Different letters in the DNA. And so probably the main thing that my lab is really interested in is trying to figure out maybe from high school biology, everybody kind of remembers that DNA makes RNA makes protein. And we're pretty good for portions of the genome, the parts of the genome that say what proteins to make. We have a pretty good sense of what RNA is are made, and what proteins they may kind of. But then what we really don't know is how to predict what those proteins do from the sequence, right? So it's like we have parts of the program, but we just don't really know how to predict what the functional effects are going to be when we make changes. Right. I mean, is it kind of deterministic, don't you actually know from the DNA, what RNA might make? Or I guess, in biology anywhere you pull out a thread it's more complicated than you think, right? Yeah, it's pretty interesting. And that we have a sense of, I guess, I would say so we know for the parts of the genome that actually code for proteins, which is a tiny amount of the genome, a really small fraction. We have sense of what RNA is are made, but there's way more regulation after that. First just for the RNA, the RNA kind of will loop around, and cut itself up to make kind of different variants. And then when we make proteins from that, I think one of the big challenges is figuring out a protein is a linear sequence that has to fold into a three-dimensional structure, and that three-dimensional structure does something. And I think a great example of where machine learning has had a real impact in biology is AlphaFold too, is a great example where there's been this problem for a long time, what three-dimensional structure to linear protein sequences make. And here machine learning algorithms have improved our ability to predict that, but we still don't know what those proteins do when they're folded, or whether they just fold into one confirmation or multiple confirmation. I think there's a lot more questions like that. Could you give me an example of one that you do know? Because we know some, right? There's some mechanisms that we understand, right? Yeah, there's some like in terms of protein folding or in terms of- Just in terms of the whole sequence. What's the sort of canonical example from high school biology, you have some different letters, so then you're missing a protein and then you have some disease, right? That's sort of kind of my mental model is that even right? Yeah, there are like a small number of, I guess, initially it's like Mendel with the peas, right? You learn about Mendel with a piece in high school. And it's like, ""Oh, depending with the sequences, it's either going to be pink flowers or white flowers."" And I think people thought that was going to be the case for genes. And there are a small number of genes like sickle cell anemia is a great example of a gene where we know that this gene, if you have this variant, you're going to have sickle cell anemia. If you don't you won't. But most traits whether it's height, or autism, or diabetes, or whatever are actually, it's sort of like there's a whole collection of thousands of genes that determine whether or not you're going to get a particular disease and how you have a distribution of genes that mean you're more or less likely to have a disease. And then that distribution interacts with your environment and what you're exposed to. It's more complicated than Mendel made it seem. Your research is on the actual kind of physical mechanism that goes from you have more of some kind of protein and then something happens. Is that right? I guess my research and again, it's like so technical. There's a few different things that I would say my research focuses on. At a basic level, one of the questions that I'm interested in is when you have changes in the sequence of a protein, or changes in the part of the genome that tell you when and how much to make of that protein, how do those changes alter or function? I guess, initially I was a physicist, my PhD is in physics. And one of the things that I think is really interesting is that these sequences code for molecules, three-dimensional molecules, and a change in the sequence of that molecule changes the physical forces that uses to interact with other molecules. That can affect whether a cell lives or dies, whether a fetus lives or dies. It's sort of this interaction at the scale where a changes the level of a molecule can have profound influences for a fetus or a cell. My lab is really interested in how changes in the sequence of a molecule affect its structure and function. I'm not sure if that's like specific- No, totally, let me see if I can repeat this back. It sounds like you're interested in... The DNA makes RNA and there's probably some asterisks there and then the RNA kind of makes a linear sequence of a protein. And it sounds you're sort of interested in like how the changes in the composition, I guess, of that protein sort of change something that happens beyond that. DNA makes RNA makes protein, proteins then fold into a three-dimensional structure, and they do things in the cell. Sometimes they bind RNA to tell the cell when it should make other genes, they bind other proteins to transmit signals. Proteins are kind of the functional workhorse of what makes stuff happen in yourselves. And my lab is really interested in how do changes in the sequences alter the structure and function of the molecules. And I guess, I would say sort of two more things. One of the things, our approach is it's a problem of staggering complexity, right? The number of possible amino acid combinations for an average size protein is larger than the number of atoms in the universe. So we're never going to be able to test all possible variants and see what they do. That's just impossible. So we're really interested in trying to figure out, can we create libraries in which we systematically vary sequence? It varies these physical properties, and we assess the effect on function, so that we can kind of learn not just a black box relationship between sequence and function, but we can ultimately develop quantitative and predictive models that would allow us to predict not just for the molecules we study, but for all molecules, how sequence changes alter function. I see. But through kind of a physical understanding versus I feel like that the machine learning perspective might be to sort of like, ""Hey, let's treat this as a black box, possibly in the sort of look for patterns here versus trying to understand the actual physics of what's happening."" Exactly. Interesting. Where we've really loved collaborating with machine learning specialists is, our approach is we develop these tools, we make these devices called microfluidic devices that are kind of you can sort of picture the way integrated circuits made it possible to do a lot of electronic computations and have very small footprint. And that kind of led to this revolution in computer science hardware. For us what we do is we make these microfluidic devices that allow us to do fluidic computations in high throughput in very small footprints. Now what we can do is, we can do- Fluidic computation? Right. Normally if you were going to an experiment itself in biology, you sort of picture test tubes, and Petri dishes, and big things. And if you wanted to do a thousand reactions, you need these giant expensive robots. And so what we've been doing is we've been using this approach where we can create these tiny devices that instead of using five milliliters of fluid for each reaction, we use about a nanoliter, and these devices make it possible to use fewer reagents. So everything is low cost. We can automate things on these devices without the use of expensive robots. And now the main power of these technologies that they allow us to make a thousand measurements in the amount of time, and cost that it used to take to make one in biology. And now I think that that means that we can generate data at a scale that allows us to quantitatively test predictions from our colleagues in ML, right? So you all need ground truth. You need some ground truth measurement to assess what's going on. And you can't just have one or two, you need enough that you can do some sort of regression to figure out where is your model successful, and where is it failing? And so our job is to make measurements of a thousand things really quantitatively, where we can interface back and forth with ML people to test those predictions, revise, and refine those models. And hopefully try and use some of these ML predictions to learn your physics. That's what we want to do. That's so cool. What would be something that would happen at that tiny scale? Are you literally putting a protein in there and watching what happen. I mean, can explain an example into that? Exactly, here's two examples of some platforms we've developed. We've been working really closely with Dan Hirschlag, and he is like an entomologist. And so one type of protein that we're interested in is enzymes, and enzymes they underpin all of our metabolism, right? They make it possible to do chemical reactions that would never happen in the absence of an enzyme. They're important, both for ourselves, they're the tools people use in modern molecular biology, you use them to make libraries for sequencing. You use them when you do your laundry, right? And some sort of things that bust up stains on your clothes. And we still don't really know how the sequence of an enzyme specifies its function. One thing that we can do now is just like the Moderna vaccine, right? Everybody's sort of heard now we can make this mRNA vaccine, and we can program it to make something that we want. We can create little pieces of DNA, each of which specifies a protein we want to make. We can use a robot so that we spot bits of this DNA in an array. So we have like a thousand little spots, and we know the program encoded by the DNA in each spot. We can take one of these devices that we make, that has little chambers, and align them to the spots. And then there's sort of this magical mixture of all of the stuff that you need to turn DNA into RNA and protein, the company sell. It's like you just buy this little tube that has the polymerase you learned about in high school biology, the ribosome that makes the protein, all that stuff. We'd push it into these little chamber- That's Nano leader. Nano leader is all fits. A Nano leader is like your hair, a hair strand is like 100 microns. Each of the chambers in these devices is about the diameter of your hair and the height of a 10th of your hair, right? We use like a lot of the machinery that people use for lithography to make least integrated circuits. We use all the same equipment to make these tiny devices. And now we can make a little- I can say I see the integrated circuit analogy. Yeah, exactly. We really do use a lot of the same equipment, except for now, instead of pushing electrons around, we're actually pushing fluid that contains molecules in different ways within these devices. We can make each one of these enzyme variants in each chamber. And now we can quantitatively ask when you make this mutation, how does it affect the ability of this enzyme to catalyze the reaction it's supposed to catalyze? That's an example of one of the things that we do, and the reason why you would want to do it is, this might help us classify variants in the human population for whether or not they're likely to compromise function and cause disease. It could also maybe help us generate new enzymes that eat up environmental waste, or design new enzymes to do things that we want to do. One other example, I guess, of something that we do is, historically when you've looked at a population of cells, let's say, from a tumor. We've ground up all those cells, and we've asked what's the behavior of that population of cells. Within all of those cells, maybe there's one or two rare cells that's resistant to a drug. And when we treat a patient with that drug, those one or two cells are going to proliferate and drive treatment failures, right? We need a way where instead of looking at all of the cells mashed up together, we want to be able to profile the cells one by one. Another technology that we're using that this field microfluidics allows you to do is we can actually put every cell in a tiny droplet, like basically a little water in oil, a droplet that serves as a tiny compartment where we can interrogate that cell by itself without looking at all of its neighbors at the same time. And so, again, those droplets are like a Nano leader, right? And we can look at a million cells individually at once in their own little nanoliter compartments. How do you break up all the cells? Some cells just grow like blood cells grow by themselves, for solid cells, this is something that actually our collaborators do. I never actually really know how to do this, but you can treat them with enzymes that chew up the stuff that connect them so that they separate, right? If they grow on a surface, you treat them with this enzyme, and then they separate from each other and come into the solution. And then we put them in the bubbles, in the droplets. In some automated way, I assume? Yeah, I wish I could show you the videos, I could send you- I know. Send me some videos, we'll put some links to them. That'd be awesome. I'll send you videos of both. Cool. I mean, I guess it's funny, a really dumb question that I keep being kind of afraid to ask, but I think other people might be feeling, it's like everyone's sort of saw in machine learning the protein folding thing. And kind of everybody knows that protein folding is this interesting big problem that a lot of ML people have worked on, but I've always kind of, I guess, I'll ask the question, why is protein folding so important? It seems like it would be really critical to your work. But can't you also just look at the proteins, and see what shape they have. Are they literally just that? It's such a good question. These questions are awesome. Yes, they're tiny, right? They're really tiny. And so to see the structure of a protein, you have a few options. Historically people have tried to crystallize them. They've tried to get them to basically form a three-dimensional crystal where they're all in the same shape. And then they've taken them to a giant x-ray beam, right? Like the Stanford Linear Accelerator or other places like this. They've shot x-rays through them, they've looked at the diffraction pattern that they make. And then they apply a bunch of kind of super fancy fourier transforms essentially to take the diffraction pattern, and turn it back into a picture of what the protein looks like in 3D. It's really hard, right? You go to talks all the time where a graduate student is like, ""I spent five years trying to crystallize this one protein, right?"" A lot of proteins don't crystallize, it's slow. And the other thing is most proteins don't exist as a single static structure, they're wiggling around all the time. And that wiggling is really important for their job, for how they do their function. More recently, people have started using cryo electron microscopy is another way to kind of look at proteins, where you like freeze proteins down on these metal grids. And then you use the super fancy, like $10 million microscopes to look at the individual particles. And there's been a real revolution in this in the last several years, basically because of that image processing algorithms have made it possible to align many different particles and kind of reconstruct what things look like. But that's only suitable for big proteins. You can't really do it for small proteins. The vast majority of proteins don't have crystal structures, or are these cryo-EM structures, so we just don't know what they look like. And we've really looked at some of them fold into these three-dimensional structures. A lot of them are kind of unfolded and we have very few pictures of what they're doing. So trying to predict the number of structures we have is just tiny compared to the number of proteins we know about, and the structures are off in a static picture. That's one reason why it's a hard problem, and the reason why you want to know is let's say, you want to design a new drug to target a protein. You kind of need to know that 3D shape, so you can figure out where would you put a drug, and what kind of drug is likely to fit in there, and alter the function of that protein, maybe. I'm not sure if that makes sense. Yeah, that makes sense. No, that was really helpful thank you. And I saw this amazing blog posts that I think was from more of a computer science perspective on how the Moderna drug works, which is super helpful for me to understand why you would kind of care about. I dunno that's my mental model now. I think Drew forwarded that to me. Oh, cool. I was like, ""Wow, this is so amazing that people could figure this stuff out and then make a certain shape."" And then it seemed like they modified it a little bit from the natural one to kind of make the shape better. And I can't believe they figured it out, but it sounds like they figured it out in days. What was really interesting was yeah, Drew was like, ""The people who figured this out should get some huge prize."" And I think what's really interesting is that, it's been like tens of thousands of people over decades who have made it possible, right? I think for this particular vaccine, there are people that sort of specialized in mRNA vaccine production, that'd be critical. There are people that specialize in Coronavirus in general, and spike protein, which is the protein on the surface that we're trying to mimic with these vaccines. But it's really kind of a beautiful example of so many different fields of biology have contributed to that, in terms of thinking about the folding and the structure of RNA to figure out, I mean, both in terms of immunology, what parts of the protein should we be targeting, in terms of thinking about nucleic acid biology, how do we make an mRNA that's going to be pretty stable, right? What are some of the modifications that you're talking about made it more stable? Thinking about delivery, how do we wrap it so that it can go into your body and isn't just instantly chewed up by all of the enzymes in your body that are looking for foreign invaders and why don't you have them up all the time? It's an amazing triumph of the scientific community, and scientists from so many different fields. It's really exciting, I guess. Yeah, it seems cool. I guess, I'm kind of curious your experience collaborating with machine learning practitioners. Can you maybe describe what that's been like, and what... I mean, I remember when I first started working with people in medicine with my last company, it was such a funny kind of cultural mismatch. I remember them telling me, they were doing microscopy and they were like, ""We have so much data, we have like 500 people's tumors that have been sliced and stay into something."" And I was just like, ""Wait a minute, I'm not sure any of my methods would work with that."" They're big file I guess, but I think I need more than a big file. I mean, I love it, I think it's so pleasurable. I love working with practitioners of machine learning, both because as a field it's moving so fast, right? The things that are possible this year are different than what was possible six months ago, a year ago. It's interesting to think about all of the ways in which algorithms that are developed for figuring out whose face is in a photo can instantly be ported to biology, right? So you can leverage all of the commercial interest in developing something like that towards problems like what we study that are never going to be as commercially viable, or interesting, right? So that's really exciting. In terms of the culture mismatch, what's funny, I think is, I'm on thesis committees for a lot of ML students now. And for ML students, what they want is they want their algorithm to have the best AUC by 2... Even at small, an incremental benefit is good, right? Because it could potentially scale. But for them any points that are unexplained are like a failure. Whereas for us, that's the most interesting part, right? What do those points that are not explained by the algorithm having common and are we discovering new biology or new physics that we hadn't thought about before? And it's cool in that trying the mathematical facility that ML practitioners have is astounding. And it's fun where some of the questions, people are like, ""I'm sorry, just what is a protein?"" Where we're like, ""Oh, okay, we can answer that."" And then at the same time, I'm looking at that image everybody shows of their neural net, with all the layers, and I have no idea how you would actually implement that. I've seen the picture, I have the picture in my papers, right? But I would never be able to actually do, I don't even know the first thing about how to set it up. I think that's what's sort of fun about it, is that there's this natural complementarity, but there's so much for each side to learn that it's always really intellectually engaging. Do you feel like coming from kind of a physics background, is it maybe disappointing that... I mean, do you worry at all that maybe the only way to explain some of these systems is through kind of a black box technique? I feel like the protein folding thing, it seems like for a long time, I knew that people were, it seemed like they were really trying to just simulate what would happen to the proteins. And I'm not totally up on the latest stuff, but it seemed to me like the approach that worked really well with the AlphaFold was sort of less physics simulation, and more just kind of like observing, I guess, where do you think that goes? For me, what I really think about is, and this is sort of the heart of some of the stuff that we've been doing with [inaudible 00:29:14] lab is, here's my motivation for why I think we need to eventually know the physical principles. Let's say we were wanting to learn how to create a new ballistic, or fly a new thing. If we just wanted to take this black box approach, it would be like each time we want to fly a new thing or make a new ballistic, we're just going to make a thousand ballistics. And then we're going to shoot them, and we're going to collect the data, and then we'll train and neural will hold out some of the data, and we'll train a neural net. And now we're going to be able to predict it for that system. The fact that we know the laws of gravity means that we're not restricted to just now working with that system that we've tested a thousand times, we can work with all kinds of systems, because we have this generalizable physical model. I don't think it's necessarily at odds with machine learning approaches. One thing that we think is really exciting is, let's say, you're able to train a neural net on a given dataset, where you have a physical hypothesis about what's going on. We can do a lot of experiments, we can do a thousand experiments, a thousand measurements in parallel each time we run an assay, but that's often not enough to fully characterize the system. If you have a neural net that can predict behaviors, now we can feed it in things where we're systematically varying particular physical parameters to ask what it thinks, right? I think sometimes you can use these black box models as a way to do in silico experiments, at a scale far beyond what you could reach with even the highest throughput- In silico. Right? Thank you, come on. In silico experiments, you can do a millions, or billions in silico experiments. Choose a thousand that you then go back and test with some of our experimental techniques to see what's going on. I think rather than just thinking of neural network predictions as an endpoint, like I'm going to train on the system, and I'm going to predict for the system, can we use them as a tool to uncover generalizable physical principles? To me that's like a really interesting and complimentary way to think about those problems. That makes sense. Another question, and I guess, I'm just asking kind of the dumb questions that maybe I'm afraid to ask other people. But when I look at image recognition, and I feel like I've been working on image recognition for two decades. I've sort of seen it go from totally not working to working maybe better than humans in a lot of controlled cases. Particularly in like clinical cases, right? There's a lot of clinical evidence that it can work better than- Yeah, and you talked about how it's mostly trained off of images online, like really ImageNet was kind of this moment where it started working, where people decided to collect a huge set of data. And then there was this thing called, Transfer Learning, where that's kind of become mainstream were people take something trained on a big set of data, and then they kind of fine tune it on a smaller set of data. Do you feel like that is working in biology? Is there an analogy to that where there's some like big data sets that you could train on, and then kind of modify the models to work on smaller datasets? It just seems so clear that's what happened in images. And I don't really know that the biology analogies to that. I mean, I think that's definitely when I go to talks right now, everybody's always using transfer learning. It's because of the fact that it's hard to make measurements, right? Maybe you have one system, and you've characterized it to death. And now you want to know the other system, the ability to train on the system that you've really characterized well, and then predict in a different system that's hugely valuable, and I think it's seeing applications all the time in biology. Maybe people have characterized one cell type really, really well. Now there's another cell type, but it costs so much money to characterize a cell type at that depth that now if they can use transfer learning to predict behavior for this novel cell type that hasn't been as well, characterized that's super valuable. Well, I guess there's a lot of commercial interest in biology too, but it's maybe it's cheaper to classify images. It was interesting that one very motivated professor Fei-Fei, at Stanford, could make this amazing data set that kind of changed the whole field. And I sort of imagined that the same type of thing in biology would be expensive enough to make it complicated and hard. And maybe no one's really motivated to do this as a general works project. I guess, another thing is the ability to crowdsource measurements, right? Yeah, totally. People are generating images all day long and uploading them and making them publicly available. I think the closest you come to that would be sequencing, right? People are sequencing and people are willingly sharing all of their genomic data with 23andme and ancestry.com and all of these places. That has sort of seen crowdsourced growth, and still not on the scale of images, but a huge amount of data. But I think what's really lacking is we're getting more and more sequences and that's great, but in the same way that for the images, you not only needed the images, but you needed initially to know, is this a dog or a cat or an arm or a barbell, or what is this? That's what I think we don't have as much in biology right now. We have all the sequence, but we don't have the functional annotation that goes with it that allows us to make that same sort of progress. And I think to me that's the bottleneck, right? That's the thing that we're trying to solve. I actually didn't really realize what your work was on this. It's so cool that you're, I mean, it seems like actually collecting data at a far bigger scale would be the perfect thing to make, the mathematical models work better. So it seems pretty cool. I think I'm obsessed with them. Marcus Covert, told me about this book, the Weather Makers, that he said was really great. And so I read it and we both took different things away from it. But part of the book is sort of talking about 100 years ago, people had these kind of primitive atmospheric models where you could have a room full of people, all doing calculations in parallel. They would start calculating, and at the end of 24 hours they had the ability to predict what the weather was going to be in 24 hours in the future. It was like, all of these people calculating could basically just keep pace with time, and it didn't really give you any predictive power. Now, we have these weather models that you can look 10 days out, and have a pretty good sense of if it's going to rain, if it's going to snow, what's going to happen with the weather. What Marcus took away from it is that, you really need to look at an entire system. Like a cell in its entirety, in order to really be able to model, and understand the behavior. What I took away from it was, this progress was really only enabled by the fact that we had weather stations around the world that were recording huge amounts of data, not in relative terms like, ""Oh, it's 10% hotter today than it was yesterday, or it's going to rain twice as much today as yesterday."" But there we're recording all of these data in terms of physical constants, like temperature and precipitation humidity. And that allowed us to develop these atmospheric models and to test the predictions of physical models, and to develop this predictive power. Our big push, using these technologies, using these microfluidic technologies, they make it possible to shrink biology, and make measurements at a much more rapid pace. We're really interested in trying to say, ""Can we do this for biological systems, but can we also always do it in the language of physical constance, right?"" There's quantities like energies that reflect how much energy it takes to fold something, and what the energy is when two different molecules come together. And so those are the kinds of quantities we're trying to measure, and I think that ultimately those types of measurements in concert with huge amounts of sequence data, and ML algorithms that are seeking to predict the function of different sequences, and how changes to the sequences alter function, those kinds of physical constants can be integrated with all of that other stuff to eventually attack these problems seem intractable now, but so did weather prediction 100 years ago. I guess, I feel like scientists always hate to answer this question, but I'm sure everybody's thinking that when he used that analogy like, when you roll this kind of work forward, 10 or 20 years or more, how would it affect like my day-to-day life? Is it like a lot of diseases get cured or I mean, what is the ultimate impact of this stuff? A lot of our science is pretty basic to be unashamed about it. Defense of that is, CRISPR has been this amazing tool, and it came from people studying the mechanisms of bacterial immunity, right? Nobody was looking for things that were necessarily going to transform our ability to engineer genomes, but that's what we found in the course of doing basic scientific research.",6298 +Adrien Gaidon — Advancing ML Research in Autonomous Vehicles,https://www.youtube.com/watch?v=MUJpblzB4Jo,2882,2021-04-22,"ML is everywhere, but it's like starting from perception, and more and more moving to prediction. And I think where to cutting edge is really is in the planning and control side. So how do you bridge these gaps from pixels to the steering? So ML is everywhere, of course, I'm biased. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lucas Biewald. Adrien is exactly the kind of person we imagined having on this podcast when we started it. He's the head of research at TRI, Toyota's Research arm. And he's been a long time user, maybe one of the very first users of Weights & Biases. And every time I talk to him, he has interesting ideas on the field of machine learning, and the tools necessary to make it really work in production. This is going to be a really interesting conversation. All right, I have a whole bunch of questions, but I thought I'd start with a little bit of an oddball one only for you. I always use this metaphor, building the Weights & Biases tools that I hope our users love our tools in the same way that a guitar player loves their guitars. And I know you are a guitar player, do you have a favorite one that you own and play? Right there. Nice. I don't know, I need to deactivate my Zoom background, but this is a road worn fender strat. How beautiful. I love the road worn because first, I don't mind damaging it and more. And second it has a really nice feel. I think like tools that's more general than just guitars, but they grow on you. It's almost like a lot of musicians give names to their guitars like, Eric Clapton famously, et cetera. And I think like really good tools, they become part of you and, and you develop a relationship with them. It's the case for cars, it's the case for guitars, virtual tools it's kind of interesting. Some tools definitely become part of you, I haven't named a WNB report after my daughter or something like this yet, but who knows? Well, besides WNB, what are your favorite tools that you use in your day-to-day job building machine learning models? If you're talking as a manager, that's not the same as if you're talking as a scientist. Answer both, please. I would love to have both. All right, I want mention maybe the ones that everybody knows and love. I like Todoist a lot as a manager, that's a great way to manage your tasks and stuff like this. I've been a longtime user of Todoist, and I really, really like it. Tried a lot of different ways to manage to dos, and et cetera, and keep track of this if they've come a points and whatever things like gamification. This one, I think is a pretty nice recommendation I can give to everybody that has a lot of tasks and wants to stay on top of them. As a manager, just one is good. And what do you like about it? As you said to do list is the app. We should put a link to that. It's Todoist. Todoist. Yeah, in one word. I have to say I use WorkFlowy, and I'm like super attached to it myself. I'm curious, what do you like about Todoist? What's the... It's very simple. I think tools in this complicated world where you have many things to do has to be dead simple, and good synchronization across devices is super important because when you switch from one to the other, et cetera. Nice. Well, this show isn't for managers, it's for the scientists. Tell us about, as a scientist, what tools you love. For the scientists, I mean, Jupyter Notebooks, super like obviously, right? I said that won't mention the ones that everybody uses, but this one's still, I will mention it otherwise, I mean, PyTorch is just awesome. As a manager and now senior manager, I get less and less time to do technical stuff as it should, right? I focus on empowering my teams, et cetera, but I still have this itch, and sometimes I do a lot of code reviews, and I'm like, ""Oh, yeah, I want to try this thing."" PyTorch, even as a senior manager that doesn't do like 50% or even 30% of the day coding, I still get back to it very, very quickly, because it's just so simple, very few abstractions, very little like vocabulary. It's not the DSL, right? It's an empire, it's an empire on steroids, and that's just so easy to use. Interesting, can you say anything else about PyTorch? I'm always kind of curious, because PyTorch just seems to have these like passionate fans or user base. The people that use other frameworks, they use them a lot and they seem to like them, but somehow the people that use PyTorch seem just incredible like advocates. Do you have any sense of why that is? I've been working in computer vision since 2007. And so basically in 2012, I finished my PhD and then I moved to research and industry at Xerox Research. And then what was interesting was that was just the time, I was big into kernel methods. Everything had to become Convex, and learning theory, Vac Nick, super-clean. And then 2012 Krzyzewski, non convexity, not a problem. All these kinds of things. And Caffe was very big, became very, very big, especially 2013 with like Roski or Schick and like Berkeley doing amazing work there, Young King Jaw, et cetera, et cetera. All this kind of tools really born there, but Caffe is C++ library, fairly easy to reproduce things, but it's fairly hard to do your own fork and do something very different, especially in the learning algorithm, like not changing architectures, et cetera, and that's the easy part of deep learning. But changing the task you're working on, or changing the overall learning algorithm is more complicated. And I maintain an internal fork of Caffe and we did some papers, et cetera. But then in the alternative of Theano, which was like let's say, an early days pioneer, and I will leave it at that. A great library, but not necessarily the most user-friendly one. And intensive flow came, and it's a huge hype train, right? Of like Google, everybody wanting to work for Google. I was like TensorFlow, TensorFlow. Of course jumped on the bandwagon too, and then Lua Torch were the only kind of asterix a little bit, like the little village of resistance to the Roman Empire. But I never really liked Lua, I was always a big Python fan. And so when PyTorch came out, the nice clean design of torch with Python, that basically became a no brainer. And everybody that did Python like PI data, kind of sphere, like site by sphere, it was familiar with NumPy, immediately became familiar with PyTorch. And that was the genius, right? No training, no onboarding, you know NumPy, you can use PyTorch. I'm psyched for jacks, it's kind of interesting because now Google kind of realize this, that the DSL, Graph Bay is very complicated ecosystem, very complete ecosystem. So really nice for production setup, but for researchers that are a bit more on the crazy side of things, I wish I had the time to play with jacks basically. I just looked at things and it sounds amazing. And I think maybe there's going to be more diversity on PyTorch light tools. It's so interesting, I think with TensorFlow came out, I thought, ""Oh, people would just want to use the same tool, and everyone's just going to kind of switch to this."" It's been kind of surprising to see the passionate advocates of PyTorch, at least by our internal metrics, it seems like it's getting more popular. Do you have any sense about what it is about the design that makes it feel so satisfying? Right. There is a really, really great paper at NeurIPS last year, if I remember correctly. I think it's already cited 1500 times, which is huge, right? For the paper to be cited more than a thousand times is a big, big deal. In less than one year, it just shows you how popular it actually is. It's by the PyTorch authors like Soumith Chintala, et cetera, et cetera, all these great people. And they described their design principles in that papers, I can recommend to your listeners to check it out. It's very accessible, it's a NeurIPS paper, so it might scare people away that's and that, but it's not, it's really, really good paper to read. I won't summarize that paper, but the design principles are really, really good. And they are basically directly the results for me of the great UX. It's a user experience, right? Is that just you can't force people in this age of open source and a free tools that are widely available and also wildly known, right? It's not like you have to live under a rock to not know PyTorch exist, right? Then the best user experience wins. It's just as simple as that. And PyTorch is just so few abstractions, I think it's like maybe four obstructions total that you have to know, that's PyTorch specific, right? And again, the rest is just very, very generic, very powerful, has nice workflows, there's PyTorch lightening that's tries to simplify those workflows. Maybe Cara style, high-level APIs, but just the base level one is just you go from idea to experiments really quickly. So that would be my why. I love the idea of user experience of a library or like a deep learning framework. It's like you normally think of user experiences like a website, but the developer user experience is so important. I totally agree. And it's because basically just like coding, right? Is becoming democratized. There's a huge thing about no code, which is all about that. But code is still like people are going to code for a long time. It's like people say, ""Oh no code."" People will stop coding soon. No, same thing as like self-driving cars, they're going to happen, but it doesn't mean that people are going to stop driving soon. There's kind of a lot of good things that can happen if you simplify the user experience for what used to be called power of users. But the '70s era is done where only the most hardcore geeks code, everybody codes now. I mean, a lot of people code. And I guess you're more than a researcher, right? I mean, you've been working on autonomous vehicles that Toyota, trying to deploy them for quite a long time, right? I think some people might worry that PyTorch isn't easy to put into production, but you have one of the biggest challenge of productionizing your systems. How have you thought about that? Does PyTorch work for you in production? TRI, Toyota Research Institute, where I worked was created like in 2016, roughly. And so we haven't worked that long on it, compared to the let's say, to the Waymo et cetera, that really started in 2009. But what's fun is that we kind of started with PyTorch almost from the start. We did at first, the first year we were really working about TensorFlow. Mostly for that reasons that you're describing is like putting things in production, et cetera. But we found out that iterating was actually a bit painful, and because the decision was within our power as the researcher, we kind of switched to PyTorch fairly quickly. So that was one of the decisions I made early on that we're really happy with, the downside to it is when you are on the bleeding edge, you have blood all over your fingers, you know you cut yourself, right? And especially on the production side. In the early days, what it meant is that when you deploy something like in Python, or something like not glorious. I don't want to go into the details, because it's a bit like... But then as the ecosystem progress, right? And now especially in, I would say in the last year or so PyTorch has really been growing and it's focused on productionizing. It turned out to be a really good bet. We did it from a research perspective, and a velocity of iteration because I mean, that's our stance is autonomous driving, still a lot of research problems to be solved, right? A lot of research. So you want to optimize for the bottleneck, right? It's something very well until you get that production system this theory of constraints. You look at your workflow, where's the bottleneck, and optimizing the rest doesn't really matter because it's still the bottleneck that governs the speed at which you iterate. We found that experimenting was the bottleneck, and now like production is not a bottleneck anymore because there's great tools. Like Onyx, we're using Onyx, we're using TensorRT as part of our tool chain to deploy models that are efficiently running on GPU's, et cetera. There's even more recent projects via TRTorch, which enables you to go directly from PyTorch to TensorRT. And there's many more far beyond Nvidia hardware, there's exciting cross-compilation tools, things like the TVM stack, et cetera, et cetera. Production wise, I think it's such a big deal to deploy models that if the second top framework or the top two frameworks don't have good solutions for that, they're doomed to fail. So they understood this a long time ago. And it's good now. And just so you're saying today that you actually can do it. You can get it into production, it's not a problem. Oh, yeah. I mean, we couldn't do it before, it's just not necessarily very nice production engineering, but now there are tools to do this in a really state-of-the-art way. Not just by researcher standards, but by proper engineering standards. Right, right. I feel a little reticent to ask you this question is probably everyone asks you this isn't my parents asking me. But in your view, since you're at the front lines, what is the state of self-driving car like? I think everyone talks about it, yet I can't get into a car, and tell it to drive me somewhere and have it do it. On the other hand, I live in San Francisco and I see these cars driving around autonomously all the time. What's going on? That's a good question, right? That's a standard question, that's a question everybody should ask themselves every six months or so. And the question is for how long, that I don't know, I can't predict the future, but I think that's one thing that attracted me when I came to TRI was, I was just surprised how much people thought it was solved, right? Back in 2016, when I really started working on autonomous driving, as a researcher work in computer vision and machine learning, I was like, I'm excited about a lot of exciting problems, how do we leverage the fact that labeling is expensive? So we want to optimize label efficiency, maybe even go self supervise, and these kinds of things. And it was just starting at this period, or using simulation, right? One of the big things I've done is leveraging simulation. And I was like, ""Wow, there's so many open research challenges, it's so cool."" As a researcher, I have a huge playground and a huge societal motivation to actually solve, there's a 1.35 million traffic fatalities every year on the road. I was like, ""This is a huge societal problem, it's super important to solve that, because this 1.35 million is just crazy."" But the reason that it's so high is because it's so hard. And so it's super hard problem, super important, and there's so many research problems. As a researcher, super excited. Move to Bay Area, everybody's like it's in six months. In six months I got this, everybody from this little startup, to the big companies, to OEMs, everybody was coming up with dates. 2018, we got this, but in 2016, 18, 19, 20, you name it. Go back in 2016 and listened to any announcements, or et cetera. You will see everybody promised everything every time. And it's to get VC money and everything like this, I know it's Bay Area we get funding. But the stance, Gill Pratt, our CEO, which is a former DARPA Director, he was an MIT professor and everything. He is very, very smart and an excellent roboticist. And he had always a deep appreciation for the problems. And he was at the labs and all kinds of things. And so it's always been like it's much harder than people think is going to take much longer than people think. And therefore, if you're serious about it, you should be committing long-term resources, and treat it as a research problem. Research is our middle name, like John Leonard, a famous robotics professors, is one of our VPs always says that. It's going to take a while, it's going to take a while and people are now coming to this realization, because in spite of all the hype and everything, when the results are not there at the given time, while you have to face the facts, right? And so now what we're seeing is we're seeing a conservation in the field. People that are really committed to this problem, longterm, they're willing to sink in the money to time, et cetera, and maybe open their minds a little bit to, ""Hey, it's research."" We for instance, need like strong partnerships with Academia, which we work a lot with Stanford, MIT, and University of Michigan for those reasons. We don't know all the answers, we got to work with people too, that also don't know the answers, but can take the scientific approach to try to them out versus just say, ""It's solved, we just need to throw 100 code monkeys, or 1,000 code monkeys, or 10,000 code monkeys at it, and it's going to work."" I think that's not the case. And even the engineers at this company is actually doing a fair amount of research. Even in the engineering heavy companies, I think so. I was telling a Slack community that I was going to interview you and asking them if they had any questions they wanted to ask. And I thought one of the really good ones was, it's a little bit general, but you're kind of alluding to it is what are the big academic advances coming, that'll kind of change the game for self-driving cars? And you seem like the perfect person to have a perspective on this. One thing that I'm particularly excited about, and that I've been doing some work on is differentiable rendering. There's this huge ambitious vision, I think the academic professor that I think embodies this the best is probably Justin and Belmont, MIT is a really, really amazing professor, if you don't know about him, just check out his research. And him and his students, and [inaudible 00:15:50] Wu is now a professor at Stanford. We're actually discussing with them and they have super cool ideas around this vision as inverse graphics program. And I think that's really the right way to frame the problem, Alan [Reel 00:16:02], another really interesting professor was basically calling this analysis by synthesis. So the idea is that what you want to do is with deep learning right now, which is fully supervised, is just you're learning a function that says, ""Here's an image, you say jump."" I say, ""How high?"" Is like, here's an image, cat, dog. Just say cat or dog. Cat wrong, that's a dog. And you do that thousands and thousands of times, right? It's not unlike how we teach, how I was teaching my daughter colors, as like red, yellow, no, it's red, blue, no it's red. And you do this kind. Then it kind of exponential takes off and they become much smarter in their learning. But this initial phase of learning, which has roads memorization kind of like, this is how deep learning works. The problem with that is that interpretability, data costs, lots of problems around that. And so for vision, what's interesting is the world has structure, right? And there's physics like Newton existed. There's a physics stuff, there's gravity, there's physics stuff alike. There's a lot of inductive biases that you can leverage, you can take basically just physics and physical laws and then try to bake it into your learning approach. And differentiable rendering or inverse graphics is one way to do it. Basically it's just take your sensor, you're trying to deconstruct the world, and resynthesize it. And that way you can compare in a self supervise way what you reconstructed from what you observed. And the benefit of that is that you get systems that generalize much better, that can be trained on arbitrary amounts of raw data, don't need labels. And they also have some interpretability to them, they have some structure, right? Because they're deconstructing the world and following some structure, et cetera. Differentiable rendering is a big, big one for me, vision inverse graphics is a big one, and there's many others. Self supervised learning in general is something I'm very excited about, and it goes beyond just differentiable rendering. There's many other ways to leverage self supervision, especially time when you look at video like the temporal dynamics, the contrastive learning is a super hot topic right now. And there's interesting works, I think from Max Wellings Lab called the contrastive structured world models that I think is a cool paper, not really super applicable right now, but I think pure and exciting ideas and I would just leave it at that. Vision has inverse graphics, self supervised learning, I'm super stoked about that. I hadn't heard of contrastive learning before. Can you describe that briefly? You did such a good job with that with the other topic. All right. Well, I mean, overall in a simple way, I would say that contrasted learning, there's a really cool paper that I can recommend everybody to read, which is the paper from godfather of deep learning, Geoff Hinton, it's called SimCLR, SimCLR. And it explains a little bit in... It got state-of-the-art results, basically there's two big approaches in contrasted learning that work really well as SimCLR and MoCo from FAIR, Kaiming He, another super impressive researcher. And the basic idea is, it's some form of metric learning if you want. It's you basically want to learn a representation that verifies some ordering property, or some distance property. A traditional way would be, here's an example, here's one that is close to it, and here's one that is far from it. And what you want is you want to learn the properties of your representations, such that this is true, and in a very simple way. And in general, it's related to metric learning in general way, but the cool thing is that for instance, in this C-SWMs paper, Contrastive structure world models, paper that was mentioning, you can look at it as temporal dynamics is one way, things that are closing time should be closed representation in feature space, and things that are far should be further away. It's not always true, and actually we have an ongoing work with Stanford paper called the Cocon, co-operative contrastive learning, where the idea is, in some cases in videos, things repeat themselves. And so you want to basically leverage multiview relationships, such that you know that the same thing in multiple views should also be closed. It's not just contrastive learning, but also cooperative. But it's an exploding field, there's so much work on that. The cool thing about it, the SimCLR, et cetera, it was shown that you can replace pre-training on the larger label dataset like image net by just doing unsupervised pre-training with contrastive loss. Well, super cool. And in practice, it's a big deal, because for instance, we can't use ImageNet to deploy products. If you're wondering like, ""Oh, I can just easily take an ImageNet pre-trained model, get a few labels, few shots transfer, and use it for production."" You can't really do that, unless you have a license, a commercial license or something like this. Being able to do unsupervised pre-training, which was one of the early days, early inspirations of deep learning, with restricted Boltzmann machines and whatever, you wants to do unsupervised pre-training with a lot of data for a lot of time. And then very quickly fine tune with a few shots setting, like a few labels. And it seems like we're there now. Very cool. All right, switching gears a little bit, I just want to make sure I ask you this question, because you were telling me that you listened to our interview with Anantha, who's a VP of engineering at Lyft, and I think he brings maybe a different company's perspective, and maybe also a different... He kind of came up through engineering and thinks of himself as an engineer. And I was kind of wondering how your answers for the same questions about taking autonomous vehicle to market would differ from what he said. One thing that I take from what he said was, he talked a lot about the organizational aspect. I think that was really interesting because when you think about engineering and you think about the problem like self-driving cars, it's not a one man or 10 men team, right? Or one. It's that 10 people effort. The challenge is it requires a lot of people and a coordination of a lot of people, also it's a robotics problem that is pretty wide in the skill set that it requires. You have from people like hardware, we have amazing hardware people at TRI, which is kind of always impresses me, because I can't use solder iron, even if you put the gun to my head, but these guys, they are magicians. We have really good hardware people, you have cloud people, you have all kinds of different skills. And one thing that I remember was in the podcast was, ML is a skill, right? ML is a skill that is to be shared with everybody, and so that's why it's kind of diffused in the company to be successful at deploying this. I think that's a really good point, I agree. I would add something to it, which is because I lead a machine learning team, right? There is such a thing, so even though it's a skill and it should be everybody has it. I actually lead a team called machine learning, right? Machine learning research, and so if it's a skill and it's diffuse why you have a team that's like this. And we iterated through a couple of models of we're kind of experts, and then teams can basically request projects where we help. So we kind of like embedded in other teams, but that was not necessarily super successful, we basically got back to we do our own projects, and we try to then seed some kind of more crazy ML projects that other team then can carry forward. In terms of bringing it to markets, for me, this is the organizational challenge. I know it's kind of maybe not typical answer, but I think because he insisted on that, I think this is really good to... There's something called the Conway's law, which is an organization that produces software tends to produce software that's structured like the organization. If you have typically in self-driving cars, you have a perception team, you have a prediction team, you have a planning team, or you have a perception module, you have a prediction module, and then you planning model, right? And then you have the whole kinds of challenges as a manager, which I discovered, which is like siloing, communication across teams, all these kinds of things. And as an ML person, that's leading an ML team. What I found difficult is that, in ML you want the holy grail for self-driving cars is that they improve experience. And I think that's one of the biggest misconceptions that people have about learning. If you chat with like people like your grandma or whatever, about learning and you explain them the high level concept, what they immediately think is that the robot learns after deployment, right? You kind of your self driving car might be done when you buy it, but it's going to become smarter because you're going to teach it. And that's what machine learning is. And that's not at all what it is, right? That's not at all how it works, right? There's a duty cycle, there's an operator... You retrieve data, you look at data, you label it, you trust it, and then you deploy it. And this can take a long time, right? On some huge timescale this might be true, but on the short time scale, it's absolutely not true. The iteration speed is the key, and the challenges with this organization around perception, prediction, planning makes it very difficult to have the whole system optimized really quickly from use. And so I think that's the major bottleneck for me as a machine learning person, which is, if driving from demonstrations, like user experience and things like this, how can we make every system as quickly improving as possible? And this is this idea that we're very big on TRI called fleet learning, right? Which we don't care just for cars, but for home robots in general is like, we have millions of evolutions and millions of years of evolution plus decades of parental education, and machines like a car doesn't have that leisure, right? Nobody would buy a Toyota if they had to say, ""All right, I buy six months old, and then I have to tolerate all kinds of distractions like we were talking about just before recording."" And no way people would buy a car like that, or a robot like that, right? That destroys half the home and then say, ""Oh, it's okay, it's learning."" We got to speed things up, right? So the learning has to be much more accelerated for machines than it is for humans, and the only way to do that is parallelism. And so fleet learning is something we're very, very big on for that purpose. Fleet learning and end-to-end system level optimization and the right organization to match behind, I would say are the three big bottlenecks to deploy any robotic system. Interesting, I guess I kind of think of machine learning is primarily helping with perception. Am I wrong on that? Do you view machine learning as something that goes everywhere in the- Yeah, both. Yes, you're right that today perception is the main application for machine learning at least in robotics. The reason for it is because there's just no way around it, ImageNet competition is kind of funny, one of my mentors and one of the people I admire the most is the called Fleur [inaudible 00:26:20]. And he was the head of the computer vision done at Xerox Research. And he won the ImageNet challenge before deep learning. And in the year of deep learning, people say, ""Oh, in deep learning have the error rate."" Well, they have the error rate of flow, which he improved every year was improving 2% extra. There's some kind of inevitability to it, and again, flow became really good at this, in the lab, and we all got into deep because again, we face the evidence as scientists. It's inevitable because it works so much better, and also because there's no other way. You cannot engineer a world model, because you do this. And then you say like, ""Oh, these are the labels I need, these are the features I need, and all these kinds of things."" And then the world constantly changes, the world is non-stationary. Then you have like scooters, you have literally humans flying at 30 miles per hour on the streets. And you're like, ""Wait, what? Is that a pedestrian? Is that a motorcycle? Is that a bird? Is that Superman? What the hell? And so it's inevitable and it works so much better. For perception, it's no brainer, even the most hardcore feature engineering, passionate people or people that believe there's an equation for everything. Nobody I know argues that this is the wrong approach to perception, but it's not the solution either, it's not like a slam dunk either because we need to go beyond that. I would say robust perception is not solved, some form of perception when you know everything, et cetera. And you don't care for these nine nines of reliability, right? You can get really, really far, but uncertainty modeling, handling like false positives and all these kinds of things, that's a really hard problem. That's why machine learning, every obstruction is leaky. That'd be going back to PyTorch, that's why I like minimizing abstractions, because any obstruction is leaky. And the problem with the modular robotic stacks, like perception, prediction, planning, is that you're making obstructions, you're making APIs. And the contracts you're making is like, if you think microservices type of things, they're all statistical in nature. You're kind of saying, I'm going to give you something that I'm calling a red traffic lights, and I'm confident that 99% of the time I'm right. What happens during this 1% you're on your own, right? And it's unavoidable, right? Because no system will ever be perfect, and you shouldn't require a robot to be perfect, it needs to be better than a human, but it doesn't need to be perfect, otherwise you will never shit. How do you robustly handle uncertainty, and how does it propagate through each layer, and how do you think statistically versus logically or symbolically. And that's becoming harder and harder as you move from perception, to prediction, to planning, because in planning is actually reasoning, right? It's search, it's reasoning, it's a higher order, cognitive function in a sense. And manipulating just feature vectors, like esoteric feature vectors it's not really how it works. This neural symbolic system, the best of both worlds, like Marco Pavona is awesome Stanford professor, that's doing cool research on that. How do you combine deep learning with more logical forms of reasoning? Something also we're looking at that's your eye a little bit. How does it work today? Do you actually send more information? I feel like other people that I've talked to have talked about not just sending the output of the perception algorithm, but maybe even some of the activations of the parts of the neural network before the output. But then I wonder, what do you do with that downstream in a sort of logical system? How does TRI handle that? Right. Actually that's not the approach we're taking, because you're right. It's kind of you're just pushing the thing under the rug. It's like hot potato game, it's like, I don't have to solve this problem. There you go. And typically it doesn't really work well across teams is like, ""I don't know if this going to work, but that's your job now."" My personal holy grail is like building an end-to-end differentiable, but modular system. It's still like engineering what you know, but learning what you don't. And so what it means is that you still have a perception module. It's still outputs some concepts like, ""Oh, this is a person."" Person's exists, roads exists, we noticed, right? The problem is that we're unsure whether our inference about them is right. Here and my boss Wolfram Burgard, which is one of the legends of robotic because he wrote this book called Probabilistic Robotics with Sebastian Thrun and Dieter Fox, and created this whole movement. One thing we discussed very often with Wolfram is like, there shouldn't be an argmax, right? If you have an argmax in the middle, somewhere upstream, you are basically destroying uncertainty, right? You're just forgetting any uncertainty you have, and what's really interesting is that, this is coming from a theoretical perspective, but from again, an organizational perspective, if you are the planning team and I give you something from a non-perception, and I give you, this is a red traffic light, and I'm wrong, you're going to be saying, ""Hey, we crashed, it's your fault, you're wrong fix it."" And I'm like, ""Well, but I can not always be right, and you will be la, la, la, right?"" This is not how it works. The things you pass, every data structure that you pass, every information that you pass is a distribution. It's probably stick nature. I know it'll sound like a Bayesian crazy guy, but I'm not a Bayesian guy, but from just a principled approach, you aren't certain about everything, right? That's a good principle in life too, you shouldn't be like too confident in everything, but so you pass the uncertainties. Very concretely your object detector, you try as much as possible to not argmax, let's say over to logic to say like, ""Oh, this is a person I'm sure."" You're passing the full probability scores. And then you have to handle it downstream. You have to have a model that doesn't say, ""If person do this, right? That breaks any kind of rule based system you would have downstream."" You have to digest uncertainty, or we have a recent paper at Iris where we showed that, for instance, you can pass a perception of probabilistic perception outputs into an imitation learning like behavior cloning. The system is done with a ETH, Andreas Buehler an intern of mine. It's going to be published soon. Passing uncertainty and leveraging uncertainty in the representation for downstream applications. We also have very cool research with Stanford, with Boris Evanovich and [inaudible 00:32:17], two wonderful PhD students at Stanford working with Marco Pavon, and Mark Schwagger, and it's interesting, it's like people in robotics and Aerionautics, et cetera. And they're really, really good at thinking about safety and these constraints. And so here the idea was, Boris made a paper called trajectoron, trajectoron++, which takes in tracks of objects and can output multiple possible trajectories. And that's great, you can predict the future on that. But the problem with that is that it's very difficult to leverage in a planner. Now we can say, ""Oh, I could go left, I could go right, I'm not sure. And then the planner is, how do I decide? Right. And if you're too conservative, right? If you mind safety and you're too conservative, then what happens is that everything is possible, therefore you have the frozen car problem, right? It's like, I don't know what to do, everything is possible therefore, I will not move. So then you have a self-driving car, but it stays in the garage, right? Not great. With high key voice, we basically did a system where we modified some of the controls. So it's like, you have to have very deep knowledge about control, and people Mark Schwagger, Marco Pavon are really super, super smart about this. And this is called risk sensitive control, where it, basically, what you can do is you can leverage this different samples from the trajectories, and reason in terms of control of how do I minimize my risk? How do I optimize my objective, like I want to drive, I want to go there, right? But at the same time, I want to avoid collisions. And so a really interesting thing is that there's a simple mathematic trick called the entropic risk. And I can refer to the same thing published at IROs. And you can find this on my website, where you can basically change the objective function. So it's almost just a change of mathematical formulation of the optimization problem, of how to plan and you can have a very interpretable high level variable that's called the risk sensitivity to say, ""If you're risk sensitive; you can go there, If you're a risk neutral; you can go there, if you're a risk seeking; you can go there.""",6786 +Nimrod Shabtay — Deployment and Monitoring at Nanit,https://www.youtube.com/watch?v=agWzytw7tcs,2039,2021-04-15,"The focus, as I see it in the industry has shifted from sometimes, making the models into making them work well in the real world, and be able to be flexible enough and adapt changes. I guess, I can say that, many times, maintaining the model and make it good and reliable out there is sometimes much harder than actually developing it. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Nimrod is a senior computer vision algorithms developer at Nanit and the father of two children. Nanit develops smart baby monitoring systems, and it's a product that I happen to use every day. So, I'm extra excited to talk to him. Nimrod, I'm super excited to talk to you about the article you wrote on ML in Production, but I'd say I'm especially excited to talk to you because you make maybe the app that I use the most these days, the Nanit app. My daughter actually turned one today, and we've been using it for the last year. Basically, every morning, my mother-in-law and my wife discuss the stats from the previous night's sleep. I really, really love your app, I could say that honestly, and I was proud to discover that you are customers of Weights & Biases. But I was wondering if you could start by maybe talking about what your app does and what the history of the company is, and how you think about that. Yeah, sure. So, first, I'm happy to be here. The whole company started by an idea of a staff, one of the founders, that actually wanted to monitor his son's sleep during the night. Since he came from the whole world of processes and monitoring using cameras, and he wanted to take that to his son, and it started as a project when he was at Cornell university and everything just rolled from there, actually. And since we have a camera and he is from the field of computer vision, we started the camera, and we started doing the smart baby monitor using computer vision algorithms that can attract sleep, also, the breathing motion, and then, let you celebrate the milestones of your baby. For example, sleeping, falling asleep first time on his own, and sleeping through the night without any visits from the parents, which is great for us, the parents, of course. And they're giving you a specific sleep tips in order to improve your baby's sleep. Actually, the key, or I can say what guides the companies is what value can we extract from visual data that the camera collects. So, it's kind of obvious on sleep, and of course, on breathing for young babies. But also, this is the guidelines that guide us for the next products and features, how to give value in terms of wealth and wellness to our customers. And it's also really unique since also, this product has two hats basically. We can have the hat of a consumer of electronic product as you use it, and it's also for research tool, which started to being used more and more recently. Researchers are doing the home sleep research. So, it's pretty cool that science and technology are working together and we get to deliver a really interesting product. That is really cool. And I think folks who are listening to this who haven't had children yet might not realize how essential sleep is for your sanity as a parent, and also, how important sleep is for the sanity of your child. Oh, for everyone, yeah. I think we thought much more about sleep in the last year than I ever thought about before. One of the key advantages of the product is, as parents, you get up at night for your children, and you're drowsy, and you don't remember exactly, did I get up two times, it was at 3:00 AM, maybe it was 5:00, I don't remember. And Nanit just collects you the data and serves to you clearly in order to make useful summary of the night, and you can also make data-driven decisions, if you want, and not by beliefs, because this whole field of baby sleep is full with beliefs. Some say that this method works better than the other. And here, you get the facts, you get the data. The baby slept well, the baby slept better, the baby didn't slept that good this night. And we also see that, since parents are more focusing on the baby's sleep, also, babies with Nanit sleeps better, they sleep longer, their sleep quality is better. Because everyone is in this process and they're focusing. So, it's really amazing, I must say. That's really amazing. How do you know that babies that use Nanit sleep better? We have a large user base, and we often send servers to our customers, and they actually respond to that. And we see in the statistics and what they're telling that just babies with Nanit can sleep better because you're more aware of that. The tips are useful. So, you're in a mindset of improving and how sleep is important, I guess that's- Oh, that's very cool. Can you break down what the... You know, this is supposed to be an ML podcast, parenting has been coming up an awful lot lately, I think we've been talking to, can you break down the pieces that are kind of ML problems, or computer vision problems that you need to solve to make the app work? Yeah. We use all sorts of computer vision algorithm in order to get a good understanding of the scene. I mean, in order to know... For example, when the baby is falling asleep on his own, and whether a parent comes to visit or not, all those are actually computer vision problems that we need to solve. And we actually serve multiple models during the night in order to get the whole scene understanding. On top of that, we take those outputs from the models and serve you the data much more clearly, so it's been allowed going during the night. Do you run the models on the phone, or do you run them in the cloud? How does that work? Mostly, in the cloud. We do have some algorithms that run on the camera as well. But mostly, on the cloud. Can you give me some sense of what the scale of this is, like how much data your models are handling, or how many streams of video you get in a typical night? Yeah. Let's take a short example. We have more than 100,000 users and we have full nights, which basically means that if we serve, for example, every 10 minutes or so, we're getting to a few tens of millions of calls per models per night. It's a nice scale. I mean, we get to serve over tens of millions of requests per night to all our users. And these are pretty sensitive models. I've noticed that you've never gone down. I mean, at least, in my experience, like it seems like you do a really good job with reliability, and I would think you'd have maybe a higher reliability bar than some other applications of folks we've talked to. Yeah. Well, you're right. Since babies are actually the most important things to the parents, we try to be reliable as possible in terms of robustness of the models and accuracy of the models. And also, in terms of runtime and to reduce downtime as much as possible, because, again, everyone expect our algorithms to work all the time and give them the data, especially when it comes to babies. So, we're putting a lot of effort in that as well. And I guess the sleeping model's important, but the one that seems must be kind of anxiety-producing... I mean, just talking about it, it's giving me anxiety, but the breathing motion monitoring, is that also an ML model that checks for that? Well, we use multiple models there. There are some models that are more of machine learning, deep learning base, and there are some computer vision classics, models as well, all sorts of models. Why do you use multiple models for a single application? Well, we have many tasks that we need to solve in order to get this product to be reliable and robust enough, especially when we're talking about breathing motion. So, I guess when you look at handling like millions of requests per night, I guess, what are some things that you do to make sure that this is reliable and make sure that your compute spend is the same, how do you think about like model architecture and how do you deploy your models and what frameworks and tools do you use? That's pretty interesting. At our team, we actually responsible for the whole flow, end to end. I mean, from developing and defining the task, all the research, selecting the model architecture, even conducting proof of concept many times. We'll probably elaborate on that later because I think it's really important nowadays for practitioners in the industry. Also, the whole training process, of course, where you come in the picture with some great tools, helping us find which models and experiments are better. Evaluating, which is actually pretty interesting because we try to conduct an evaluation metrics that also holds the product. Objectives insight as well, because we're not building models in vacuum, we're all tied up to a product and a value to give to our customers. It's not always that straightforward. And until deploying to production, including building monitoring systems which should be our eyes out there eventually and random optimization, as you said, not to spend so much on compute. It's pretty complicated flow, but over the last few projects, we actually formed a nice formula for it, which I posted on medium blockbusters as a guidelines, which proven to be successful in the past few times. It's actually in the trend, at least, as I see it now. I mean, every time I read on Twitter or LinkedIn or whatever about people that are talking how to maintain and deploy and make good models in production, because there isn't any silver bullet there, and there are companies that always trying to solve the whole pipeline, some part of it, so it's pretty interesting. I mean, the focus, as I see it in industry, is shifted from sometimes making the models into make them work well in the real world, and be able to be flexible enough and adapt changes. So that's... I guess, I can say that many times, maintaining the model and make it good and reliable out there is sometimes much harder than actually developing it. Which is kind of amazing, if you think of it. I guess that wasn't exactly the focus few years ago, but kind of like get there. Tell me some stories about some stuff that you've ran into, and if you could tell me specifically, like maybe pick a model and what it does and what were the issues that you ran into in the process of getting it deployed and running. Yeah. We can take object detectors, as example, we use them, of course, in our product. And- And in this case, an object detector would be like a baby detector or like a parent detector, is that fair? For example, yeah, it can be... Let's say for example, yeah, a baby detector. So, when you take a baby detector, and you actually want to start building it, you must be aware of, for example, the evaluation on how you're going to be performed. I mean, that's a common pitfall. I mean, choosing the right evaluation metrics is pretty tricky, and I know that I can say for myself, I have to recover from some bad decisions, and it's actually how you look on the model and... If you could break that down, I mean, so what would be a bad evaluation metric from a baby detector? Because I can think like, probably some people are listening to this and thinking like, okay, accuracy sounds a pretty good metric, but what would be a metric that might lead you astray with the baby detection model? Okay. Let's take just a tiny example about it, and let's say we have a baby detector and its accuracy, let's say, it's pretty good, but, eventually, in the product, we care more about the false positives than the false negative, for example. Okay. And how you look on the evaluation metrics can really affect that. So, if I will give a little bit more a weight to the false positive, we saw, for example, a decrease in accuracy on some metrics that actually average everything at once, but eventually, this is the right metric and we get a much higher performance. Or also, the other way around. I mean, we have a model that has very high accuracy, but eventually, since the product was aimed to try to decrease false positives, the product or metric was way lower. It's really how you look at it. And that's the tricky part, I think. I guess what metric then could you move to, and then, what would you do to improve that metric? Once you define the metric, you can always try and see where are the weak cases, and maybe how you can strengthen them, even if it's more data, or even if it's... Especially kind of annotation, alpha augmentations, but again, those things can be under the radar if you don't give them enough weight. I mean, that's common failure case that actually happened in the past. Wait, can you explain one more time what happens there in this failure case? Yeah. Let's say, for example, we took an accuracy overall measure for a baby detector, but we missed the tech... Sorry, the baby, when it wasn't there, but we had the high recall which compensate that. Eventually, we got to very high accuracy. But, for example, for product purposes, the precision needed to be higher in order to give enough value to the product. Actually, another way of looking at it is looking over the precision as the biggest parameter for us. And so, once we change to look at that, we could clearly see the problem and fix that. How do you fix a problem like that? So, collecting the data might assume, much dedicated way to your problem, maybe see whether you're actually collecting the right data and not just, maybe random sample the data at some point, but actually, direct yourself to the places, how the model will look when it's in production. So, you want to try to imitate that and collect data from those parts in order to make your model trained on what it's actually going to see and not on what's easy to collect. It's one of probably the best solutions. So, collecting data of the cases where you think your model is struggling and adding that, as opposed to random sampling? For example, or maybe collecting the right data to your problem. I mean, you can collect data in many ways, and collecting the data that suits your problem is the first thing, actually. I think you need to do and put a lot of thought about it. It's actually my first bullet on the guidelines, start by defining what's the right data for you. Don't just collect data and start working on a model because you're going to waste time. Do you have ways of explaining to a business person how to justify the cost of data collection in terms of some metric that they care about? Is that an important thing to you? We try, at Nanit, to keep close connection between the product and the algorithm performances. Because data collection is very expensive and our time and our resources are very expensive, so, we try not to make perfect models that will have no effect on the product. Yeah, I guess this process is pretty easy for us because this is one of the first priorities when we start a project. And are you also, in parallel, experimenting with different kinds of algorithms, or doing hyper parameter searches? Is that important to you at all, or is it really... Yeah. ... just the data collection? No, no, no, no. I mean, data collection is good, but we actually, we're doing all sorts of hyper parameter tuning and choosing models, and we have really organized methodology about what to do first. Can you tell me your methodology? Well, I mean, that in particular, but I guess a good thing to do is maybe start with trying to get the best model you can get, and trying to get an upper bound of performance, and ignore run time, for example, just to see what your up and down from the program. Because in many cases, the algorithms are working on public datasets and detectors work on MS COCO and justification, for example, on the ImageNet, but not in all cases it's a good proxy to your problem. Medical images have their own datasets, but some other parts, the data is not always natural image styles. So, you got to try models and many hyper parameter tuning. It's most of the work from the training, I mean, it's not actual work, but it takes a lot of time. And then, once the model is deployed, do you stop there? I would imagine you'd have new problems that would come up. Do you see data drift as an issue for you? Like how do you think about production monitoring? We put a lot of effort in production-monitoring. I think it's really important. And people sometimes, underestimate that, because once you deploy a model, I guess, it's not ending, it's actually the beginning, because it's much harder. And you need to invest a really good planning and making your monitoring systems to be reliable enough and give you enough confidence, because once you deploy the model, that's the only thing you can see. And the performance on the test that you get before you deploy the model is just a single time. After that, you'll get many timeframes with performance decision, and you need your monitoring to be reliable enough to spot some shifts and maybe sudden drops, and try to understand what happened. I guess I can say that we never stop with the models. We always look on the monitoring and see where we can see any problems, and what it's connected to. I think one of the issues is you don't really have ground truth in production. So, how do you know if there's a problem? It's true, it's pretty complicated. So, we always consider prediction distributions and common stuff like that. We also use other routes as well. For example, user satisfactions and maybe tickets they open, so we can spot maybe problems there that we didn't caught up in our monitors. We try to find the source whenever we can. And usually, from other parts of the company as well. Interesting. I always wonder how people do... I've heard different variants, but... Well, you actually file a ticket against the ML team if you find a bad prediction. Like what do you do with a ticket like that? Well, they don't file it specifically to the ML team, but yeah, people file tickets for bad predictions because everything is actually based on that. You can get wrong statistics and bad results and you're a parent, you want to get the data for your child, you pay for this product, and you want answers. It's actually quite a challenge, I mean, since we have so many users and we need to keep our models in a very high performance level in order not to make so many tickets for us. And also, make the experience for our users much better. So, it's a challenge. One thing you talked about in your paper or your medium post was preparations before deploying a model to production. Can you talk about how that works? Yeah. We try to simulate as much as possible how everything will be in production. For example, we actually create a production-like environment, and we also get some of the users to use that. Of course, they are supportive, and they are aware that there's going to be changes. And we try to monitor everything we can there in order to see that our model form the way we expect, that we don't see any issues. And that, of course... In parallel, we also do all of those end to end tests of all of our algorithms together to see that the new model behaves as it should be, and it doesn't rise any special problems, for instance, new block, or maybe improving them. That's most of the work that's done there. Got it. Got it. Could you tell me a little bit about how Weights & Biases fits into your workflow, and how you use the Weights & Biases tool? Yeah. With Weights & Biases, we manage all of our experiments, which is great. We also use your visualization tools in order to compare between experiments. Since you have everything so shiny and dynamic, we can also try different parameters and see what could have been without running the older model over and over again, which would save time. I'm a pretty huge fan of the reports that you can do because, as I said before, we are really tied up with the product team about the algorithms we do, which actually makes a way to show them what we do, and visualize on real time how each parameter affects the results. And we talk about what should be better for the product in the algorithm team together. So yeah, we used, tried a lot and... So, you actually use reports to share results with the product team? Yeah. We also use reports to summarize and share with product team, show them some maybe model weaknesses, whether we want to deal with this now, or maybe deal with this later. For example, how changing parameters can help. It's better for a mutual work and transparency because sometimes, you tend to be a little bit suspicious from things you don't understand, and once we understand their job and they understand our job, I think the mutual job is much better. We've seen that once you talk about it and you explain, and they can understand your world, and you can understand theirs, we can make decisions which are much more good for the company. So, it's actually pretty useful for us. Do you often go down paths where there's like a product feature you might want to make, but you're not sure if you're going to be able to make the machine learning algorithm accurate enough or powerful enough to actually make the feature possible? Do you ever get in situations like that? All the time. This is one of the main challenges we have when working with this scale and working on such sensitive data. I mean, we got such so many cools ideas and papers and works, and it's really hard to get them into production. This gap is sometimes pretty big. I can just name one example that pops into my head, GANs. GANs, for example, they're amazing example for that. They do marvelous things. But it's really hard to get them into production. I mean, they often tend not to converge, and it's worked well on this dataset, but not in this dataset. And these datasets work not good enough. So, it's a pretty big challenge how to be innovative and giving good and valuable features, but also, reliable and accurate, which is... What might you do it again, I'm trying to picture that. Like I don't want any deepfakes of my baby. No, no, no, not deepfakes but there are many other uses of GAN that we can use maybe for enhance images and make your nice fun features, that you can celebrate like your baby with a different background and stuff like that. So, the so-called... all sorts of stuff that GANs can be really useful, but again, there's a big gap between an experiment and paper, and actually getting into production. I mean, I know that in the last couple of years, there's been a lot of advances, almost like a tsunami of advances in computer vision. Have any of them been relevant to you? Do you take recent stuff and get them in production? Or is that stuff too kind of theoretical to really matter for the practical stuff you're doing? We always try, take state of the art and trying to adapt them to our domain, in our fields, which is easier. Mainly object detection, we talked about it, so it's... Since tasks are pretty much solved, let's say, or pretty much comfortable to get them into production. So, yeah, it's much easier. But there are other fields that we try. I honestly say we try all the time, sometimes, really hard to bridge this gap, but it's definitely something that keeps us motivated and try to do it all the time. I mean, if you stay behind in this field, you probably won't exist that long, this is what I do. Sure. Yeah. Is there, I guess, any paper or like line of research that you can talk about as being especially relevant to the work you're doing? I can talk about some nice researches we did lastly, and all of them are actually somehow related. I mean, they're all using the sleep metrics that we have, which have the algorithms at the back. For example, during the pandemic, during COVID, actually Nanit helped to kept families together. For example, when the grandparents can't see their grandchildren, and Nanit allows that. And we also checked, during the COVID, what are the effects on babies. And we actually is trying to study the difference between children, that their parents were essential and went to work as usual, and the parents that stayed at home. And we actually saw at the first few weeks like from end of March, let's say, for the first few weeks, we saw that the sleep of the babies actually got worse. Oh... Yeah. But it was actually improved after a couple of months. We saw that the sleep of the babies that are parents stayed at home actually got back to normal, which is pretty amazing. It's actually means that babies are resilient to the change and they adapt fast which is kind of cool. Can I ask you, so this is... I mean, this is like, I think, for a lot of parents, the most drama-filled topic is sleep training the baby where you leave the baby and let them cry at various lengths and teach them to go to sleep on their own, instead of with you holding them. Do you have an opinion on that? Well, since I'm not a sleep expert, I can only say, from my experience, it's important to let the baby sleep on their own, I guess, not in any cost, but... Do you have any data on that? I guess you do sort of track when the baby falls asleep on their own. Yeah. Yeah, we do. I'm not sure if I have any relevant research that we've done in this field, but again, this is the beauty of Nanit. I mean, you can actually test your assumptions, I would say, because if you believe in that, and then, the objective data tells you that it's right, so that's good. And if not, so, you might really want to reconsider, but that's up to you. I mean, you got the data, you can decide. Do you publish like aggregate statistics like that on different things that help babies sleep? We do have our researches that we publish. I'm not sure regarding those, what helps and what doesn't specifically. We did publish research about the screen time and how it affects babies and young children. And it's actually pretty amazing. We found out that, for example, touch screens have bigger effect on the sleep of babies, as opposed to, for example, television. I mean, television has less effect, which pretty amazed me. I mean, we sort of, touching our... causing fragmented sleep and less sleep time overall, which, it's really amazing. You can conduct a research and see it quickly, because we have large user base and engage users that can allow us and answer questions. This one is also a good research tool, I guess. That really is amazing. Yeah. And it seems like, I guess, from your app, I feel like your benchmarks of sleep are actually a little less sleep than I see in sort of like the parenting books that I read. Do you think because you're actually monitoring it, instead of getting self-reported data? Do you see systematic bias in the self-reported sleep data? Like it'll tell me like how my daughter is doing, like can I compare it to averages? And it's funny because the, the app is kind of telling me she's doing pretty good, but then, when I compare it to books that I'm reading, it seems like she's sleeping a little less than average. So maybe you're just trying to be positive and helpful, but I also wonder because we try to write down every time she wakes up and when she goes to sleep and when she gets up, and I always kind of feel like our written notes imply a little more sleep than the data actually shows us that she got. And so, I kind of wonder if previous studies are lying on kind of parents' memories, end up making us think that babies are sleeping more than they're actually sleeping. What I can say about it, I guess that's, sometimes, true. Also, I guess, getting data for babies for sleep... Especially from babies is really expensive. I mean, I'm not sure researchers can do thousands of babies, and then, record their sleep, what Nanit actually can do. So maybe there's some small portion. This is why you see some big variance between studies about sleep, I guess. I guess that would be the reason, this is my assumption. I guess, it there any other takeaways besides avoiding touchscreens to help a baby sleep? Any conclusions you've come to, with your large scale data collection? So, most of... actually, the significant tips that we see are actually incorporated in the app. So, helping baby fall asleep on his own is, of course, a remarkable sign for that. Because once he wakes up during the night, you can comes back to bed. And so, I guess what we see and what is... we're trying to translate it and validate it, of course, and send it as tips, if possible. Cool. Well, I guess we always end with two questions, and I want to make sure we have a little time for that. The second to last question is what is one underrated aspect of machine learning that you think people should pay more attention to than they do? I would say building a good process for deploying the models. I mean, making something that works as a system, and not occasionally work and not, because sometimes, people tends to, yeah, okay, let's take the data. Let's train it's. Okay, it's very good on accuracy. Okay, we can deploy it. And then, the performance are bad. And now, the model is in the air, and it's much harder to fix it. So, I'd say conducting this methodology, this pipeline of how to work better is something that people should pay more attention. And I think that's what we see, at least, what I read on Twitter and LinkedIn and stuff like that, people are paying more and more attention to that. And I think that's important for the industry. And are there tools that you use to help with that? In building those pipelines? So, we use whatever... For example, managing experiment and showing the report and see everything really helps us to get understanding on how it's exactly done. Try and simulate the production line, this is what works for us, but I know there are several companies and there are several products out there that can do many things. And this is why I wrote it as guidelines, because probably, some of the tips there, it could be useful for many people and some of them are not. Totally. And then, I guess maybe you answered my last question, but I'll ask it anyway. So, when you look at machine learning in general and making it work into production, what do you see as the biggest challenge from going from like research to deployed model working for customers? Yeah, as I said, I think this gap is, sometimes, is really big, it's fact. Maybe the ability to understand which paper is nice, but will it hold in production? It's a pretty big problem. You need to foresee it. And we've tried a lot of cool features that we saw in conferences and papers, but it didn't hold on our radar or maybe they weren't good enough. So, we had to drop them. Well, I really appreciate you being kind of public about your work and willing to do case studies and things like that. I think it really helps a lot of people learn best practices as they try to get models in production. So, we'll put some links to some of the work that you've put out, but I would say, please, keep doing it if you're open to it. It's super helpful for our community. Yeah, I totally agree. This is how we learned, and this is how we can share the knowledge. And I think as much as people will share the knowledge, it will be better and everyone could have great productivity, which I think is important. Totally. Thanks, Nimrod. Really appreciate it. Thank you so much. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun, and it's especially fun for me when I can actually hear from the people that are listening to the episodes. So, if you wouldn't mind, leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also, if you wouldn't mind, liking and subscribing, I'd appreciate that a lot.",5827 +"Chris Mattmann — ML Applications on Earth, Mars, and Beyond",https://www.youtube.com/watch?v=RQMYwmnLufo,2522,2021-04-08,"In the next N years, we'll be building, in partnership with ESA, the fetch rover, which is more of a couple-tricycle-sized rover that has to drive farther and faster because it's going to have to go pick up all those tubules, make it to a rendezvous point, take those tubules, fly them up out of the Martian atmosphere, into space, to a spacecraft, and then take that spacecraft back to earth. Yes, that's ambitious, but we're NASA JPL. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Chris Mattmann is Chief Technology and Innovation officer at NASA Jet Propulsion Laboratory. And he's the author of Machine Learning with TensorFlow, Second Edition. He recently worked on a number of space missions, including the Mars Rover that just landed on Mars. And I could not be more excited to talk to him about that. All right. I want to talk to you about your book and about your career, but I saw that you did some work on the recent NASA rover, or involved somehow. And I think probably everyone was watching that in the news and getting excited about it. So I was wondering if you could tell us what work you did and what it felt like to see that on Mars, and how machine learning could help with projects like that? Yeah. Lukas, I'm really interested in the new rover, it's called Perseverance. Successful landing, February 18th. Entry, descent, and landing. This is a Volkswagen-Bug-sized rover, very similar to the size of the 2012 Curiosity Rover or MSL, Mars Science Laboratory. So, it also necessitated the development of this new entry, descent and landing that we piloted in 2012, which is the sky crane. It's literally a robotic sort of craft that lowers on a crane this rover down onto the surface of Mars for a nice soft landing and so on and so forth. So that was piloted again. It's only the second time that's been used. That was amazing. And obviously, the 2020 rover, one of the cool parts about, it's got this helicopter, this drone helicopter on it called The Ingenuity. We do naming contests throughout the United States with kids in schools and ask them to name the rover and name, in this case, the helicopter, which is really cool. And I think it really symbolizes everybody's feelings, I think, during this pandemic, it's perseverance and also just the humankind ingenuity. But in terms of what we were involved with, there's a couple things. I'm the Chief Technology and Innovation Ffficer at NASA JPL. I run the artificial intelligence, analytics and innovation division. We're basically cross-cutting consultants. We do the AI practice, so we consult out to missions, projects and things like that, the cloud practice, and we also have some data visualization and infusion folks working in new ways of tech. A couple of different areas that we helped in 2020. The first was in a concept that we call Drive-By Science. It works like this, we were partnering with the Mars Surface Mobility group in a team led by Hiro Ono there. And basically, it works like this with Drive-By Science. So earth to Mars, 11 minutes at least, round trip light time, send a command there, get a message back. And so that's a very thin pipe. And so right now, Mars surface operations, even for the Curiosity Rover, and also for Perseverance, it uses about 200 images a day to plan what to do the next day, because thin pipe images are expensive, this and that. The other thing that's really important on these rovers are basically, I say these are elephant-sized vehicles with pea-sized brains, unfortunately. And there's a reason for that. They're running basically a RAD750, which is like an iPhone one processor in it. And why are they running such older technology? Well, cosmic radiation. When we put hardware up in space, cosmic radiation does wiggy stuff to the hardware. It flips bits from ones to zeros, zeros to ones. And so, we typically only fly things that are radiation hardened, which pushed us on the technology low-tick instead of the uptick for that. Tomorrow, we'll have high-performance spaceflight computing, feature GPU-like processors that are radiation hardened. And we have some technology demonstrations of that today with things like the Snapdragon, which is on the helicopter, actually. It's running a Qualcomm Snapdragon. And so we can do that because it's a technology demonstration. It's not critical to the core mission of the rover, and things like that. And so in that feature, when we have big brains on these rovers and assets and things like that, can we run deep learning on board instead of getting 200 images back, sending it across that thin pipe? What if we could give you back a million captions? What if we could run Google show and tell, or an adaptation of that using transfer learning like we've done, which is called SCOTI, for science terrain captioning. We name all of our stuff like Star Trek. And what if we're running SCOTI on board and we can give you one million captions back? So we call that Drive-By Science. And then another area I'll just mention and I'll shut up, and I'd like to make this a conversation. Another area is, we call it energy-aware optimal auto-navigation. And it's the same type of concept. It's looking out in the distance for the rover, if it sees imagery, if it sees sand, it knows those wheels aren't going to catch as well on it and it's going to use more power. If it sees rocky, it's going to catch the wheels better, it's going to use less power. So looking at energy-aware optimal auto-navigation using a similar concept. Those are the big things we've been working on. That's really interesting. So, do you do any machine learning now on the rover? Is that even possible with the hardware you have? And if you have a Snapdragon on the helicopter, it seems like you could do some in that or try to do some. So is there any happening or is it mostly older techniques for now? Yeah. A lot of it is human in the loop, but there are some elements of autonomy, both in terrain classification. We have been doing a number of work to take newer modern algorithms. The interesting part is DevOps at the edge where the edge is Mars. We talk about the edge today in the cloud or in IoT. So it's DevOps. So what you test terrestrially, you've got to make sure we can uplink it, import it to, again, these older devices and in some cases, devices that were deployed almost eight years ago, like Curiosity and things like that. And so we have been working on that. There is an algorithm called SPOC, against Star Trek names but this is a soil property object classifier. It's like a terrain classifier. And we can run that on the older devices. Obviously, the tricks with that are, you don't have a GPU, you may have to quantize the models, trade for accuracy and performance and things like that within acceptable balance. And so a lot of these things are for human subject matter expert review or for mission tactical ops review with human in the loop. The more in the future we can get that out of the loop and more autonomous decisions, we're going to need it. And I'll say one quick example. The next mission is called Mars Sample Return in the program. And the basic idea is this, this big car size rubber driving around, Perseverance. One of the things it does is it's coring rocks and it's going to drop tubules of those chord rocks as it drives over the next N years. In the next N years, we'll be building, in partnership with ESA, the fetch rover, which is more of a couple-tricycle-sized rover that has to drive farther and faster because it's going to have to go pick up all those tubules, make it to a rendezvous point, take those tubules, fly them up out of the Martian atmosphere, into space, to a spacecraft, and then take that spacecraft back to earth. Yes, that's ambitious, but we're NASA JPL. But that whole thing, you need, obviously, more increased autonomy with that 11 minute light time as well as charging and all the other stuff that we got to do on the rovers to basically make sure that they can operate successfully. Can you push for more updates at all with the current stuff you have? Dive into that for me, like updates in where? I guess could you update the software from earth or is it like once it's in there, is it like set forever? Oh yeah. They do update the software from earth, but there's windows of doing that. There's times in the mission life cycle when that's acceptable risk or they'll allow us to do that or things like that. And then there are times when they won't, obviously, during critical mission operations or associated with some science events or things like that. And so, it really depends on the mission life cycle, but we do have the capability to uplink and even to update things. In the past, those mostly have gone well, but sometimes there've been issues with updating, and so they're very reticent to do that a lot. But we do have these technology opportunities to update the assets out in space. And sometimes they even compete them. They'll issue a proposal solicitation and get the best ideas and then do stuff like that. Cool. So I guess outside of the rover, what other ML projects are going on at the JPL right now? Yeah. There's a lot. I like to talk about it in different pocket areas of ML and AI. One of the areas is cybersecurity. We look at signals, we do analysis with like Data Lake or Delta Lake type of partners, where we're getting signals from cyber and they're doing anomalies and stuff like that, detection. Another that's really driven by computer vision is what we talked about, it's the Mars surface, but not just Mars, future lunar missions, small sets, cube sets, what can we do with imagery and computer vision? Another is basically what we call science planning and scheduling. Basically, the ideas there is like, ""Okay, we've got these football-stadium-sized dishes in Madrid, Spain, Canberra, Australia, Goldstone, California. We call that the Deep Space Network. You can imagine these things, they're not just supporting the United States, they're supporting all of our international partners for missions. Everybody must use the DSN because they're just this international asset, this world asset, to communicate in deep space and not everyone has built such infrastructure. And so in any given week, the DSN is massively oversubscribed because missions know possibly months, possibly years ahead of time what their critical events are and when they need tracks on the DSN to track things. And so you can imagine this very difficult scheduling problem, 80% of which can be solved by traditional AI scheduling and planning. But the last 20% of it basically boils down to managers getting into a room and horse trading. So basically we've been doing a lot of work to learn what those trades are using like deep reinforcement learning, experimenting with quantum computing, looking at Mixed Integer Linear Programming, or MILP, traditional ways of doing that. And doing that in ways where we can apply ML to actually do a couple of things, learn what those moves that the mission managers make, because whenever you ask them, they don't tell you, because to be honest, it's just innate to them. It's like, ""Well, of course I didn't really need six hours on the track on that dish for my mission, we could have lived with four."" ""Okay. Well, why didn't you tell someone that?"" ""Well, because you always ask for more than what you can get."" And so these types of things. And so the agent has to learn that and they've got to learn how to generate optimal candidate schedules that fulfill like 46 other constraints and other things. So that's another big area. And then finally, I'd be remiss if I didn't mention there's a ton of ML just in science processing related to science data and instruments. And JPL has a whole science and instruments section and division, that their job is to basically get data off the instrument, do analytics, generate data products, build maps, build decision products, all of these things, help science research. ML is at the cusp of what I would call massive infusion in those areas, lots of experimentation going on and they're at that crux of turning it into ML ops. And that's basically where it is besides IT and business, where we also are doing it with RPA and some of these other areas. The instrument use case sounded really interesting. Can you give me some concrete examples of that? I'm just totally unfamiliar with that whole space. Yeah. So imagine JPL minting first of a kind instruments, because that's what we do in earth sciences, space sciences and planetary science. And there's a reason for that, the national labs are supposed to do that work that no other whatever commercial industry or traditional civil servant places can do. And then once we do it, we're supposed to transition then into industry and stuff. And that is very much true for hyperspectral, where the field actually was mainly defined in some ways at JPL, by people like Rob Green and the AVIRIS Spectrometer and things like that, but also in other areas, LIDAR and things like that. And so in these instruments, there are all sorts of things, the traditional model for missions at JPL is a phased life cycle where pre-phase A and phase A like formulation, phase B is like actual, real costing, phase C is where you're building the mission out and you're actually building it. Phase D is like mission launch and whatever. And E is like standard operations. And so associated with that life cycle at each stage and in particular, in phases D and E, besides delivering the mission bits from the instruments, the science data, the engineering data and stuff like that, NASA competes out, typically. It does a couple of things. It does some directed work, but it also does some competition for basically analytics and ML and things like that on the analysis. But even during the mission life cycle phase, it's basically, go from voltages, which are basically electrical signals that have measurements buried into them, to geo-calibrated radiances. So radiances is calibrated to say some space on the earth where you got to map it using orbital parameters, to basically a full physical model, in some cases to extract out from those calibrated, geo calibrated, geo-referenced radiancee what the hell it was measuring. And that in some cases it's called level two data, and there's a massive amount of it. And in that, in some cases, even in the mission is where some missions stop, and then they compete out the level three, level four product generation, which is basically taking those swaths of instrument actual measurements and mapping it to a geo globally gridded grid, and then doing other stuff and maybe combining other products on it. And so, even in the mission production life cycle, there's an opportunity for ML. Some people are looking, they say, ""Well, can I replace my full physics model with taking geo calibrated radiances and then mapping them to a map or even to values and measurements, can I say, build a neural network to do that? Can I learn a representation of something with an auto encoder? Can I do, in concept-wise, like a regression or like even a network or CNN to like do value predictions?"" And stuff like that. And there's a lot of experimentation during the science mission operations now for doing that. Because obviously, ML has the opportunity to cost much less physically, not require super computers to do some of these things on other specialized computers, but more commercially available, from the cloud and GPUs, TPUs and things like that. And finally... Well, go ahead. You were going to say something, Oh, I was just wondering, when you're saying compete out these steps of the process, is that something where like, if I want to try to build a model to do this mapping, I could go to a website and get involved? Is this is like Kaggle or is this like... How does that actually work? This is fabulous. The answer is no today, and there's a reason for it. Actually, this will parlay into the other thing I was going to say. So basically, you get the level two data out. It's massive, it's petabytes in some cases. You always say, people want the level two data, but you really don't want the level two data. An in some cases, it's so big that there may or may not be a requirement to preserve it because NASA may have made the decision or even other agencies, NOAA, you look at this to basically, well, we could always reproduce this using a big rig processing campaign. What's the minimum bits and level products that we need to store and keep around because there are preservation requirements? And they always ask that question. And so the answer related to that, to your question, is that, again, you always want the level two products and then you don't because it's too big. Okay. So what are those level two products stored in? They're HDF5, HDF4 with HDF-EOS metadata, they're NetCDF products, GRIB. There's probably a half a dozen archival formats that aren't say machine learning ready, like a big table, that have everything or a multi-dimensional table, Sympy, Numpy to do it. But there's been a massive work in the Python community and other places to integrate that stuff with... So, believe it or not HDF5 is very popular in machine learning for waits and class and all the work Francis and others did in some of these things to do that. Where did HDF5 come from? It came from NASA and earth science, actually. It was an earth science archival format from investment from NASA, NOAA and the EPA in the HDF group, which spun out as a separate organization to do it. So actually, the storing of matrices, scalers, vectors, named hierarchical representations of them actually came from representing earth science data. The challenge is what you get at that level, again, it's not globally gritted. Some people don't know what to do. Like if you can't do a point in a coordinate reference system and get a value, their heads explode, people, sometimes, because they don't understand these satellites generate these weird U-shaped orbital swaths where the data is only valid at certain times. Everyone just assumes you can interrogate something and say, ""Give me the data and do machine learning."" But there's so much processing and level processing that you have to do beyond that. And so what do people do look at? And this was the second thing I was going to talk about. There's also big opportunities. So the science mission sometimes stops at level two data production, but I'm even saying there's ML opportunity there and people are looking at it. But even in the archival, you go to earth science, you go to the DAACs, they are called the distributed active archive centers. These are for earth science, nine places across the US that you can go get data, and you can download it today. But again, does it stop it, the level two products? Does it stop at the level three? Once you get to level three, we're talking about 100 time data reduction too, because these maps take less because they're interpolated or they're globally averaged, not specific interrogatable values out of points. Chris, I feel like this is obvious to you, but I want to have a concrete picture in my head of what one of these datasets is and then what the level two version is and level three, and where it actually goes. Just one where I can really picture it. Yeah. Let's take OCO, which is the Orbiting Carbon Observatory. And it produces a value called XCO2. Which is a CO2 sources and sinks. It's a column-based measurement of CO2 in the atmosphere. The level two products for OCO, it's actually called the OCO2 mission. What they look like is kind of like a U-shaped, upside down Bell curve. Say you take a world map and you projected out in our Cartesian space, so you flatten it out to two dimensions, how you see in those maps. And then if you look at the data, just based on how the satellite orbits, first off, there's little data over water, and it's like an upside down U curve. It's almost like a sine wave, so it's like this. That's the level two data, because the way the satellite is orbiting and when it turns on and when it doesn't to get measurements, because some of these things can't see through clouds or whatever, ends up being this track. That in many cases, independent of the instrument, spectrometer, radar or whatever, just say in the OCO2 case, is a level two data product. And it might represent, I don't know, depending on the orbit life cycle, 14 days or something like that before you can get the full track across that. It depends on how long it takes to orbit. So just to make sure I understand. So the satellite's looking down at the earth and measuring the amount of CO2 with some spectral camera or something, and then it's downloading to you the amounts of CO2, but it's only along the like weird linear track that just like its orbit over the earth? But then that track, I guess, is also moving, so you get different measurements in different places on the earth? Well, that's exactly right. And in your mind, you imagine, ""Oh, we fly a mission like OCO2 and it covers geo-global grid. I ought to be able to go to Russia or Africa, or the United States at any time in any grid cell and get a value, right?"" Well, at level two, you can't, because at level two, you only have values where that satellite saw a data in its orbit. Now, at level three, what people do is they take that Bell curve, the upside down one, and they basically interpolate, or they average it so that you get values in neighboring cells, and you color the cells. It's almost in machine learning terms like a self-organizing map. So you go from basically like a sign wave to a self-organizing map, geo globally grid, where you can interrogate the values at any point in time. And this is the scientific process. And is that an official level two to level three mapping. How do you compare if two people did different mappings which one is the best one? Yeah. There you go. And that's a great point, and usually it's controlled by the science team, there are some standards in instrument families, so there's a way to do this typically with spectrometers, there's a way to do it with radars. And then there are other requirements like, what's the precision needed? What type of data density do we need at a particular pixel? What's the resolution? And those are dollars and costs because they translate to mass and power in the instrument, and they also translate to processing time and whatever afterwards, when you get the data. So these are all little knobs that mission managers, science teams, the science requirements for the mission, trade. But I just want to understand, the part that you're saying you compete out like the mapping from level two to level three here, going from the raw satellite data to the earth data, or is there a further mapping where most of the machine learning happens? Oh yeah. Great. By definition, if you look at these NASA missions and earth science, very true in planetary too, the archives again, and remember the size from level two, again, 100 times bigger in many cases than level three and level four, the dollars that they go, dollar per bit in preservation, they've got to cut off at some level because they don't have infinite money, but they're supposed to preserve this ""forever"" in some cases. And so we're talking hundreds of millions of US dollars for investment in these archives just to keep the bits around. And so a lot of the archive systems will say, or some missions will say, ""We're only distributing up to the level two products."" Some of them will distribute level three, but they'll have different rolling windows of how long they're be available using their standard algorithms and things like that. Like maybe they don't keep the level three round forever. So now what does NASA do like I was saying with you to compete and to do analytics and really rise the tide in ML and some of these things? Well, what they'll do is they'll say, ""Okay, in a particular earth science area,"" or whatever, they'll say, ""Well, we have a number of recurring,"" and NASA has this thing called research opportunities and space and earth sciences all roses. But these programs in which they release 40 different programs or whatever, to basically write a proposal, compete against other NASA centers or universities or commercial industry to basically do higher order processing, to generate maybe improved level products at the level three, at level four levels, maybe they cost less, that are more accurate, that didn't take as long to do the algorithm for, or as much scientific expertise or knowledge. And those are all the knobs where they'll do that on. It doesn't mean that such algorithms automatically get put into standard level processing or into the archives, but there's the opportunity to do it. And then some people, this is the beauty, NASA doesn't have to control everything, neither does NOAA, whatever. This creates a market and an opportunity downstream for universities, commercial partners, or whatever, to build better products. And if they do, these could become the standards eventually, and NASA is very happy to do it because they still fulfilled their mission of researching, observing, and making those data products for free to the world. So it's like I'll cut you off, and I was really interested in the... You keeps saying competed out. Are you saying that the reason that I couldn't try to go build a better mapping and give it to NASA, is that because the data's so big, it would be hard to get, or is there some other bottleneck there? Well, it's like a combination of both, you might be able to do it because you're Lukas, superpower, have access to massive cloud, whatever, but it's harder for a postdoc or somebody, a K-12 or someone at university undergraduate to be able to get the type of access to basically do this. And it's also part of just the way science occurs. There's this movement as you and I know, so in the context of like ML to the MLOps, lots of people still use Jupyter, I still do everything locally in some cases, they'll just do Jupyter locally to do stuff. But then there is this movement, JupyterHub, but even beyond that, getting stuff out of Jupyter, Python MLOps, frameworks, TensorFlow, PyTorch, whatever. There's this whole movement again, from the science research for long tail to doing it in a team with DevOps, with all these things, good software engineering, producterising, and things like that. The exact same thing exists in the context of science research in fact, many scientists would much rather love pull all the data down to their laptop and crank on it with MV-IDL, or MATLAB, or even Python, because that's what they've learned in atmospheric science. And so there's almost this mentality or paradigm shift that is even undergoing there too. So that's another part in it, Lukas. And then finally, the last thing I would say is that you also have to basically how... And some of this stuff is self-documenting and some of it isn't, but what assumptions were made at the level two and before error to get to level two, because you've already started potentially to propagate some error bars, even to get to level two, to go from geo-located, physically calibrated radiances to a level-two product, you've already made some assumptions. And so NASA does document those, those are called algorithm, theoretical basis documents, and they do make them available, but you also need to dive into some of those to know how to then apply ML to go beyond. Got it. So it's really just a really hard problem, it's not that there's any resistance to letting people try it, I guess. Totally. I'd say there's welcomeness for people trying it, it's just a really hard problem. You nailed it. I'm a little shy of asking this question, and maybe it goes nowhere, but I just want to, because me and my co-founder all love this game called the Kerbal Space Program and we all play it. I wonder if people at JPL are aware of this and if people coming in, I feel like all my instincts about rockets come from this one video game that I got obsessed with a few years ago, does this come up at all? Not with me. I'll say a lot of JPLers are playing among us right now, but that's not Kerbal Space Program. Tell me about it. Tell me, what is it? Oh man, it's this amazing game where it's very... I think what's fun is it's very self-directed, there's not really a clear goal, but you basically try to build rockets and put them into orbit. I feel like I learned a lot of just how complicated it is to like make a satellite and then try to get a rocket to that satellite and the trade-offs, like you want a lot of thrust at the beginning, but then that could be an inefficient engine. And you realize actually going to a planet with a high gravity and coming back is way, way harder than just orbiting it, for example. So I don't know, I was curious if this... Because I feel like when I talked to people at places like NASA, I have all these specific questions and then sometimes they're just like, ""Oh my God, you must've played that Kerbal Space Program game."" That's awesome. No, well, I'll bring it up. So for your audience and also for you, this isn't always clear and we even did it ourselves just now. I work at NASA, yes, but I'm at the Jet Propulsion Laboratory, which is one of the nine NASA centers. And so typically the NASA centers have different expertises or main things that they do. Big things that JPL does at Pasadena California amongst all the other nine NASA centers is that we run autonomous exploration. We do a lot of autonomous exploration. In fact, we're a center of excellence and NASA's only federally funded and research and development center for that, national lab and the Mars Program. But we don't do very much at all, very little like human space flight. A lot of the other NASA centers like Marshall, or Johnson, or Kennedy, like mission control or launch pads, that's where you see a lot of that, but our expertise typically is in... So if it's robots and it's deep space, that's usually us. Well, just for the record, that comes up to a lot of incredible space programmers. Well, okay. That's it. I'm going to tell my whole team to play this now, Lukas. So we're there. We'll move from among us to that. I really recommend it. I also wanted to ask you, because this is so impressive, you also wrote a book in your spare time while doing your day job. I was curious what inspired you to write this book. That's great. What inspired you to write this book? For me, it was almost an appreciation and an under appreciation in a way of the evolution of the field of machine learning. For me, a few years ago, actually it happened right before the pandemic, I would talk to brilliant men like you or people on my team, and then they'd be talking to all this machine learning stuff and frameworks, and I'd be like, ""Yeah, that's interesting."" Heck, like 10, 15 years ago, let's see, well, it's 13 years ago now I graduated with a PhD. I did a minimal amount of maybe what you would call machine learning today. K means clustering in my dissertation or whatever. I did a little bit of Beijing inference, but I would say the field of machine learning wasn't back then like it was today. So I heard everyone talking about this stuff, and I don't need to know stuff materially at that deep level anymore as much, but I said, ""Let me go pick up a book."" I had written a book about... well, it was in 2010, it was called Tika in Action. It was the Tika framework, long story short on that one. It's basically, we call it digital babel fish. If you read Hitchhiker's Guide to the Galaxy, you give it, any language you put the babel fish in your ear and out the other end, it's your interpretation. You can understand it. Tika is that for files, you give it any file, it tells you the file type automatically, the mind type, it extracts the text, the metadata and the language. And basically for your audience, all they need to know about it is, look it up on Wikipedia, it was the key technology to solve the Panama Papers and win the Pulitzer Prize. So I wrote that book in 2010, and I get Manning books sometimes where I talk to them and I can get a book. And there was a book that came out called Machine Learning with TensorFlow. And I said, ""Hey, can I get this book? I want to read it. I want to learn machine learning so I know what the hell my people are talking about."" And so I started reading it and I pulled out a pencil, I started drawing matrices, I started really trying to just instead of read it, do it, do the exercises in the book And so what I arrived at after nine months, and this was the lead up to 2020 in the pre-pandemic world, what I arrived at was probably 50 Jupyter notebooks, everywhere was a thrown out suggestion in the first edition of the book, like, ""Hey, you could try and build a facial identification system."" I did it. I rebuilt the VGG face model. I had a publication at Supercomputing. And I basically just had code, Jupyter notebooks and everything, and I was like, ""I've got a second edition of this book,"" because I filled in all the gaps and I added a bunch of new chapters. And so I pitched it to Manning, they loved it and I was away running. And that's how I did the Machine Learning with TensorFlow Second Edition book. And so that's how I got there. So That's so awesome. And I saw you made an interesting choice to use TensorFlow V1 in the book, Instead of V2, because it's still in use on supercomputers, is that right? Or what was the thinking there? Yeah. And V2, for me, I never shy shied away from this, you know what I mean? Heck, I was on the board of Apache, we were maintaining, what is it, a 25-year-old web server or whatever, so it's not the oldness of the technology for me, part of it was stability. And so what I was finding was that the TensorFlow 2 was changing a lot at the time. And I made the decision in the beginning, I said, ""We're going to pin it to 1.15 because that's stable or whatever. And it was still in use that big Supercomputing agency because they hadn't been on the technology uptick and I was writing it a little bit for them. What we ended up doing, and what I promise to Manning, is during the book about midway through or at the end, I would take a look at basically porting every example, every notebook in the book to TensorFlow 2. And that's exactly what we did. And let me tell you something in a testament to Google, and I give them credit for this, it took us about two weeks to port the entire book, to TensorFlow 2, myself and a couple of students and folks who literally just donated their time. And so this wasn't a massive undertaking. There was a big paradigm shift mentally in TensorFlow 1 to TensorFlow 2, but I would say 85% of the code from those notebooks is the same because it's all data preparation, making an analytics and machine learning ready, and then doing rock analysis, rock receiver operating characteristics area under the curve. None of that stuff, the beautiful libraries in Python, Matplotlib, NumPy, SciPy, Pandas, all of these things, they didn't change. What changed in the inside was basically how you set up the model for training, how you run the training step, update your gradients if it's a neural, all of these things. That part changed, but the other parts didn't. Scott Penberthy, the head of AI or applied AI at Google basically made this remark in the intro to my book is, ""Don't worry about tracking the latest and greatest XYZ API update, these models and the way you build them will stand the test of time, and I agree with them."" Cool. I guess another topic that I think you've been thinking about a lot and you brought up earlier is, ML ops and open source, are there any projects you're particularly excited about helping just deploy and maintain machine learning models? Are there any that you use at the JPL? There's a couple, I'm impressed with a lot of the Amazonian things that are coming on, SageMaker Studio and some of the things they're doing. I'm impressed by some of the capabilities that W&B is building, your company, I'm not just saying that too, there seems to be some appetite really to... And for me, my exposure to this was through the DARPA Data-Driven Discovery of Models program, but really looking at parameters, parameter tuning in an automated way, optimizations of that pipeline, and also SME-based, Subject Matter Expert based model feedback. And I really think that's where the world's going to go soon in the next, if you look at one, three, five-year timescale. AutoML is here, if you look at some of the capabilities, Google AutoML, data robot, some of these other things. So really it's going to shift what people are doing, I think, from building models all the time, let's a computer put together primitives and score them for you, and whatever. And then start there, give that feedback, change the job that that person is doing to that feedback. And I think it'll make people more optimized and things like that. Why do you think that hasn't happened already? AutoML has been out for a long time, hyperparameter optimization has been around for quite a long time and good libraries exist, including Weights & Biases has a library that folks use, but it doesn't seem like, I would say, maybe 20% of our users look like they're doing hyperparameter optimization with our stuff for somebody else's stuff. Why do you think it's not more widespread already? For me, it's a little bit of, in 2018, when I went to a Blockchain Conference at UCLA at the Blockchain Lab, and I sat there and I listened to Ethereum versus, oh God, I forget the other one. Well, obviously there was Bitcoin and then there was iOS and this and that. And I was like, ""Oh God, these are the early days that are like IETF, the Internet Engineering Task Force, where everyone was trying to build their specifications of Gopher and this and that. It feels like it's still the Wild West, and there's always dejure competition, which is standards based, getting people together and saying, ""This is what thou shall use, thou shall use MapReduce, or thou shall use whatever."" And then there's the defacto with that, is what are people actually using in what they're building? And it feels like the gap between the people that are doing the defacto development and that type of uptick in terms of framework things, aren't meeting the people that are making the framework decisions for the ops and what they're going to double down and invest in. And the closer that does move, and it will happen, I really do believe it'll happen. The pandemic accelerated that in a lot of ways, I think, but I think when those two things move closer, Lukas, I think you'll see it. Well, that's a good segue, it's two questions that we always end with, and the second last question, which maybe you've answered already, but maybe you could answer with something else is, what's the underrated aspect of machine learning that you feel like people should pay more attention to than they do today? Actually, I won't pick... I have two choices for that, one thing I could have picked was learning with less labels. That's far out the zero shot, one-shot learning, I'll just say, pay attention to that, but the soundbite here for me is ML at the edge. Everyone thinks you can take a machine learning model, put it onto an NVIDIA TX2 or Jetson, and your model is going to perform the same way and it's going to be push button. And that's just bupkis, it doesn't work like that, there's so much engineering involved, you're trading so much at the model. But look, if you look at CES, we're going to move from per capita devices, four to nine right now for people over the next five years to 40, 50, 60, 80 devices per capita. So these are all going to be running machine learning, ML at the edge. Do you want to know and be involved in what's happening there? If you do, get involved, and also realize it's not push-button model deployment and your models perform a lot differently. And so that's an underappreciated area in my mind, and I want all your smart people, and you, and your audience to focus on that because we need help. It's funny, that answers the final question that we always ask, which is what is the biggest challenge making machine learning work in the real world or at your job? Is it actually edge deployment and models behaving differently when they're actually in the edge? Is that a fair summary? Yeah, absolutely. That's a totally fair summary, Lukas. And also where the edge, we definitely do a lot of IoT on campus in clean rooms and other places, and all of this, but also where the edge is Mars. Do you have any tricks to leave us with in making things work on the edge? Is there any best practices that you've figured out to help with that? Yeah. One best practice I'll just share with you, and this is our biggest time sink, is don't change the hardware midstream. A different thing that looks compatible is actually vastly different, even within the same product family. So stick to what you got and the computing power you have, and engineer more optimizations there rather than thinking, oh, it's just... because the price point of the hardware, it makes it so attractive, ""Oh, I'll just spend another 200 bucks and get a different thing."" No, no, no, you'll spend 10 to 50 times that re-engineering your entire pipeline. So don't change the hardware midstream. Awesome. You've spoken like a real engineer. Good note to end on. Thanks, Chris. Thank you. It's good to chat with you. Back at you. Thanks, Lukas. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun, and it's especially fun for me when I can actually hear from the people that are listening to these episodes. So if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",7564 +Vladlen Koltun — The Power of Simulation and Abstraction,https://www.youtube.com/watch?v=htdsPSgbLQo,2968,2021-04-01,"I wanted to understand how we train intelligent agents that have this kind of embodied intelligence that you see in us and other animals. Where we can walk through an environment gracefully, deliberately, we can get to where we want to go, we can engage with the environment, if we need to rearrange it, we rearrange it. We clearly act spatially intelligently, and by intelligently in an embodied way. And this seems very important to me. And I want to understand it, because I think this underlies other kinds of intelligence as well. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. Vladlen Koltun is the chief scientist for Intelligent Systems at Intel, where he runs a lab of researchers working on computer vision, robotics, and mapping simulations to reality. Today, we're going to talk about drones, four legged robots and a whole bunch of cool stuff. All right, Vladlen, thanks so much for talking with us. I saw your title, it's somewhat evocative. It's the chief scientist for Intelligent Systems at Intel. Can you say a little bit about what the scope of that is? It sounds intriguing. Yeah, I prefer the term Intelligent Systems to AI. AI is a very loaded term with a very long history, a lot of baggage. As you may remember, the term fell out of favor for a very long time, because AI over promised and under delivered in the 80s, and 90s. And when I became active in the field, when I really learned quite a bit about AI, the term AI was not used by many of the most serious people in the field. People avoid the term artificial intelligence, people identified primarily as machine learning researchers. And that persisted into I'd say, the mid the 2010s actually. It's only very recently that the term AI became respectable again. And serious researchers on a large scale has started to identify themselves as artificial intelligence researchers. I somehow find that term Intelligent Systems broader. First of all, because it doesn't have the word artificial. So if we're interested in Intelligent Systems, we clearly are interested in artificial intelligent systems, but also natural intelligent systems. We want to understand the nature of intelligence, we are concerned with intelligence, understanding it and producing it and using it in systems that we create. It's a more neutral term with less baggage, I like it. I don't mind AI, but somehow I'm more predisposed to Intelligent Systems. Cool. I love it. And I always try to take the perspective of these as someone who knows about machine learning or Intelligent Systems, who maybe isn't an expert in your field, which will be super easy in this interview, because I know very little about robotics and a lot of the stuff that you've been working on. But I am very intrigued by it. And I think anyone kind of understand how cool this stuff is. So I'd love to ask you about some of the papers that I was looking at. I mean, one that kind of just stuck out to my, well, to myself now, but also my younger self is just like, unbelievably cool, was the paper that you wrote in quadruped locomotion, where you have like a walking robot, navigating terrain. And I think what was maybe most evocative about it was, you say that you basically train them completely in simulation, And so then it's sort of like zero shot learning in new terrain. And I guess, could you say for someone like me actually, who's not an expert in the field, kind of, what's like hard about this? Like, just in general, and then kind of what did your paper offer that was sort of new to this challenge? Yeah, legged locomotion is very hard, because you need to coordinate the actuation of many actuators. And there is one very visceral way to understand how hard it is. Which is to control an animated character with simple legs, where you need to actuate their different joints or their different muscles with different keys on the keyboard. And there are games like this, and you can try doing this even with just four joints. So try actuating four joints yourself, and it's basically impossible. It's just brutally brutally hard. It's this delicate dance where at the same time, in synchrony, different muscles need to fire just right and one is firing more and more strongly and the other needs to subside and this needs to be coordinated, this is the precise trajectory in a very high dimensional space. This is hard to learn, and if you look at human toddlers learning it, it takes them a good couple of years to learn it. This is even for human intelligence, which is awesome. And I use the term awesome here in this original meaning, I don't mean awesome, like a really good cup of coffee. I mean awesome, right? Even for this level of intelligence, it takes a couple of years of experience to get a hang of legged locomotion. So this is very, very hard, and we want our systems to discover this, to master this delicate dance, that as adult humans, we basically take for granted. And you can look at basically the most successful, I would say attempt so far, which is Boston Dynamics. Which is a group of incredibly smart, incredibly dedicated, insightful engineers who're some of the best in the world at this, a large group, and it took them 30 years. It took them 30 years to really get it, to really design and tune legged locomotion controllers that are very robust. We did this and depending how you count, but I would say about two, three years, primarily with two graduate students. Now, these are amazing graduate students, these are really extraordinary graduate students. But still, the fact that we could do this in two, three years speaks to the power of the approach. And the approach is essentially taking the system through a tremendous amount of experience in simulation, and have it do all the trying and falling in simulation. And then the key question after that is what happens when you learn in simulation and put the controller on the real robot, in reality, will it work? And there are a few ideas that make it work, and a few pleasant surprises where it worked better than we expected. One key idea that was introduced in our previous paper, the science robotics paper that we published a couple of years ago, is to empirically characterize the actuators that are used on the real robot. So you basically, the measure, and you do system identification, you measure the dynamics model of each actuator, empirically by just perturbing the robot, actuating the actuator and just seeing what happens, seeing how the system responds. And that means that you don't need to model complex motors with their delays and the electromechanical phenomena that happened in the actuators, you don't need to model that analytic. You can just fit a little neural network, little function approximator, to what you see. Then you take this empirical actuator model into your simulated legged system, then you have the legged system walk around on simulated terrain. That's where the pleasant surprise comes, which is that, we didn't have to model all the possible behaviors of simulated terrains and all the types of simulated terrains in simulation. We didn't have to model vegetation, we didn't have to model gravel, we didn't have to model crumbling, we didn't have to model snow and ice, just with a few simple types of terrains, and aggressively randomized geometry of these terrains, we could teach the controller to be incredibly robust. And the amazing thing that we discovered, which is maybe the most interesting outcome of this work is that in the real world, the controller was robust to things it never really explicitly saw in simulation. Snow, vegetation, running water, soft yielding compliant terrain, sand, things that would be excruciatingly hard to model. Turns out, we didn't need to model them at all. That's so cool. I guess we've talked to a whole bunch of people that work on different types of simulated data often just for the cost savings, right? Of being able to generate infinite amounts of data. And it seems like, if I could summarize what they seem to say it's that, you often benefit from still like a little bit of real world data in addition to the simulated data. But it sounds like in this case you didn't actually need it. Did it literally work like the first time you tried it or were there some tweaks that you had to make to the simulation to actually get it to bridge the gap between simulation and reality? It worked shockingly well. And what helped a lot is that Junho, just kept going. And I love working with young researchers, young engineers, young scientists, because they do things that would seem crazy to me. And if you ask me to predict, I would say that's not going to work. But fortunately, often, they don't ask me and they just try things. And so we would just watch Junho, try things out, and things kept working. So the fact that you don't need to model these very complex physical behaviors, in the terrain, in the environment. This is an empirical finding, we basically discovered this, because Junho tried it, and it worked. And then he kept doing it, and it kept working and it kept working remarkably well. So somehow, it was very good that he didn't ask me and others, is this a good idea? Should I try this? It seems like there's these obvious extensions that would be amazingly useful, like if you tried to do bipedal locomotion, and then making the robot, it's like usefully engaging with its world. Where does this line of inquiry get stuck? It seems so promising. We're definitely pushing this along a number of avenues. I'm very interested in bipeds. And we do have a project with bipeds. We're also continuing to work with quadrupeds, we have multiple projects with quadrupeds and we're far from done with quadrupeds. There's definitely more, there's more to go. And then you mentioned interaction, you mentioned engaging with the world. And this is also very interesting frontier and we have projects like this as well. So ultimately, you want not to just navigate through the world, you also want to interact with this more deliberately. Not just be robust and not fall and get to where you want to go. But after you get to where you want to go, you actually want to do something, maybe take something or somewhere else or manipulate the environment in some way. What physics simulator did you use? Is this something you built? Or did you use off to shelf? This is a custom physics simulator built by Jimin, who led the first stage of that project. That's why I said by the way that it took three years, because I'm including that previous iteration that was done by Jimin that laid a lot of the groundwork, and a lot of the systems infrastructure we ended up using. So Jimin basically built a physics simulator from scratch, to be incredibly, incredibly efficient. So it's very easy for the simulation times to get out of hand. And if you're not careful, you start looking at training times on the order of a week or more. And I've seen I've seen this happen when people just code in Python and take off the shelf components, they get hit with so much overhead and so much communication. And then I tell them that they can get one or two or three orders of magnitude if they do it themselves and sometimes it's really unnecessary. And so the debug, our debug cycle was a couple of hours in this project, so that helped. That's incredible. And that seems like such an undertaking to validate a physics simulator from scratch. Was it somehow constrained to make it a more tractable problem or? So I think what helped is that Jimin did not build a physics simulator for this project. It's not that he started this project, and then he said, ""I need to pause the research for about a year to build a custom high performance physics simulator, and then I'll get to do what I want to do."" He built it up during his PhD, during many prior publications, and it's a hobby project just like every self respecting computer graphics student has a custom rendering engine that they're maintaining. So in this area, a number of people have custom physics engines that they're maintaining just because they're frustrated with anything they get off the shelf, because it's not custom enough, it doesn't provide the interfaces they want, it doesn't provide the customizability that they want. One of the things you've mentioned in the paper or one of the papers was using privilege of learning as a learning strategy. Just something I hadn't heard of. Could you describe what that is? Yeah. It's an incredibly powerful approach that we've been using in multiple projects. And it splits the training process into two stages. In the first stage you train a sensory motor agent that has access to privileged information. That's usually the ground truth state of the agent, for example, where it is, exactly what its configuration is. So for example, for an autonomous car, it would be absolutely precise ground truth position in the world down to the millimeter. And also the ground truth configuration of the environment, that matters in the environment. The geometric layout of the environment, the positions of the other participants, the other agents in the environment and maybe even how they're moving and where they're going and why. So you you get this God's eye view into the world, the ground truth configuration of everything. And this is actually a much easier learning problem, you basically don't need to learn to perceive the world through incomplete and noisy sensors, you just need to learn to act. So the teacher, this first agent, we call it the teacher, the privilege teacher, it just learns to act. Then you get this agent, this teacher that always knows what to do, it always knows how to act very, very effectively. And then this teacher trains the student that has no access to privileged information. The student operates only on real sensors that you would have access to in the real world, noisy, incomplete sensors, maybe cameras, IMU, only onboard sensors, only onboard computation. But the student can always query the teacher and ask what would you do? What is the right thing to do? What would you do in this configuration? What would you do in this configuration? So the learning problem is, again, easier, because the student just needs to learn to perceive the environment. It essentially has a supervised learning problem now, because in any configuration it finds itself, the teacher can tell it, here is the right thing to do, here is the right thing to do. Okay? So the sensory motor learning problem is split into two. First, learning to act without perception being hard. And second, learning to perceive without action being hard. Turns out, that's much easier than just learning the two together in a bundle. That's really interesting. So in the way you did the second part of the training, make sure I got this. This second model with the realistic inputs, is it trying to match what the teacher would have done? Yeah. But it doesn't actually try to figure out an intermediate true representation of the world. It's just kind of matching the teacher, does it somehow try to actually do that mapping from noisy sensors to real world state? Right. It doesn't need to reconstruct the real world state. So there are different architectures we can imagine with different intermediate representations. But the simplest instantiation of this approach is that you just have a network that maps sensory input to action and then this network is just trained in a supervised fashion by the actions that the teacher produces. I see, cool. So I'm really just cherry picking your papers that just seem kind of awesome to me. But I was also pretty impressed by your paper, where you taught drones to do like, crazy acrobatics. Do you know what I'm talking about? Yeah. Yeah. So you talk about the simulation in that one, and it seemed like it must be really hard to simulate what actually happens to a drone as it like kind of flies in crazy ways. I mean, I'm not sure, but it seems so stochastic to me just like watching a drone. It's so hard to control a drone I was actually wondering if that... It seems like it must have been a real simulation challenge to actually make that work. Also, we should put a link to this videos because they're super cool. Yeah, yeah. This was an amazing project driven again by amazing students from University of Zurich, Antonio, Locascio. First we benefited from some infrastructure that the quadrotor community has, which is they have good quadrotor simulators, they have good models for the dynamics of quadrotors. We also benefited from some luck, which is that, not everything that can happen to a quadrotor needs to be simulated to get a good quadrotor control. So for example, we did not simulate aerodynamic effects, which are very hard to simulate. So if a quadrotor goes close to a wall, it then gets aerodynamic push back. It gets really, really hairy. But we did not simulate that and turns out we didn't need to. Because, the neural network makes decisions moment to moment, moment to moment. And if it gets a bit off track, if it's thrown around, no problem, in the very next moment, it adjusts to the state that it finds itself in. So this is closed loop control. If it was open loop control, well, it would have failed. I see. Interesting. Were there any other details that you had to get right to make that work? I mean, I'm really impressed the way you're... It seems like you're sort of effortlessly able to jump from simulation to reality. And everyone else that I talk to is like, this is like the most impossible step. But it's something about these domains or something you're doing seems to work really effectively for you? Yeah, yeah. So we're getting a hang of this. And there are a few key ideas that have served us well. One key idea is obstruction. So obstruction is really, really key. The more obstruct the representation that a sensor or a sensory modality produces, the easier it is to transfer from simulation to reality. So what do you mean by obstructs? Can you give me an example about obstruct versus not obstruct? Yeah. Let's look at three points on the obstruction spectrum. Point number one, a regular camera, like the camera that is pointing at you now and the camera that is pointing at me now, point number one. Point number two, a depth map coming out of a stereo camera. So we have a stereo camera, it's a real sensor, it really exists, produces a depth map. Let's look at that depth. Point number three sparse feature tracks that a feature extractor like SIFT would produce. So just very salient points in the image and just a few points that are being tracked through time so you're getting just a document. So the depth map is more abstract than the color image. Why is that? Because there are degrees of variability that would affect the color image, that the depth map is invariant to. the color of that rack behind you would massively affect the color image, but would not affect affect the depth map. Is it sunny? Is a dark? Are you now at night with your environment of lit by lamps? All of that affects the color image and it's brutally hard to simulate. And it's brutally hard to simulate. And it's brutally hard to nail the appearance so that the simulated appearance matches the statistics of the real appearance. Because we're just not very good at modeling the reflectance of real objects. We're not good at dealing with translucency, refraction, we're still not so great at simulating light transport. So all these things that determine the appearance of the color image, very, very hard to simulate. The depth map is invariant to all of that, it gives you primarily a reading of the geometric layout of the environment. So, if you have a policy that operates on depth maps, it will transfer much more easily from simulation to reality, because things that we are not good at simulating, like the actual appearance of objects, they don't affect the depth map. And then if you take something even more obstruct, let's say you run a feature extractor, a sparse feature tracker through time, the video will just be a collection of pots, like a moving dot, a moving point display. It actually still gives you a lot of information about the content of the environment. But now it's invariant to much more, it's invariant also to geometric details and quite a lot of the content of the environment. So maybe you don't even have to get the geometry of the environment than the detailed content of the environment right, either. So now that's even more obstructs. And that last representation, is the presentation that we used in the deep drone robotics project. So the drone, even though it has a camera, and it could look at the color image, it deliberately doesn't. It deliberately obstructs away all the appearance and the geometric detail and just operates on sparse feature tracks. And turns out that we could train that policy with that sensory input in very simple simulated environments, and they would just work out of the box in the real world. Well, it's so interesting, it makes me wonder, I mean, people that we've talked to talked about sort of end to end learning with like autonomous vehicles versus pieces. And I guess I'll never consider that if you kind of break it up more or have like more intermediate representations, it might make simulation easier transferring from simulation to the real. But that actually makes total sense. Yeah. So I think for example, the output of a lighter is easier to simulate, than the original environment that gave rise to that output. So if you look at the output of a lighter is it's a pretty sparse points. If you train a policy that operates on the sparse point set, maybe you don't need a very detailed super high fidelity model of the environment, certainly maybe not of its appearance, because you don't really see that appearance reflected much in the lighter reading. Interesting. I guess I also wanted to ask you about another piece of work that you did that was intriguing. Which is this simple factory paper, where you have kind of a setup to train things much faster. And I have to confess I kind of struggled to understand what you were doing. So I would love just kind of a high level explanation. I mean, like, maybe, I'm not reinforcement learning expert at all. So maybe kind of like set up what the problem is, and kind of what your contribution is that made these things run so much faster. Yeah. So, our goal is to see how far we can push the throughput of a sensory motor learning systems in simulation. And we're particularly interested in sensory motor learning in immersive three dimensional environments. I'm personally a bit less jazzed by environments such as board games, or even Atari, because it's still quite far from the real world. Although you have done a fair amount of work on it, haven't you? Right. So we've done some, but what really excites me deeply is the training systems that work in immersive 3d environments, because that, to me, is the big prize. If we do that really, really well, that brings us closer to deploying systems in the physical world. The physical world is three dimensional, the physical world is immersive, perceived from a first person view, onboard sensing and computation by animals, including humans. And these are the kinds of systems that I would love to be able to be able to create. So that's where we tried to go in our simulated environments. And these simulated environments tend to be if you're not careful, they're pretty computationally intensive. And if you just use, again, if you use out of the box systems, you will notice a pattern here. If you just use tools out of the box, and have some high level Python scripting on top of existing tools, you'll basically have a simulation environment that runs at 30 frames per second, maybe 60 frames per second. You're roughly collecting experience, and something that corresponds to real time. Now, as we mentioned, it takes a human toddler, a couple of years of experience, to learn to walk. And a human toddler is a much better learner, a much more effective learner than any system we have right now. So two years is a bit slow if you ask me for a debug cycle, I don't want to have a debug cycle of two years. And in fact, what we need to do is take this amount of experience, and then multiply it by several orders of magnitude, because the models that we're training are much more data hungry, and they're much poor learners, then the human toddler. So then basically, we're looking at compressing maybe centuries of experience until we get better at learning algorithms and the models we design. But with the current models and algorithms, the challenge is to compress perhaps centuries of experience into overnight, and overnight training, which is a reasonably comfortable debug cycle. You launch a run, you go home, you come back in the morning, you have experimental results. That basically means that you need to operate, you need to collect experience and use it for learning on the orders of hundreds of 1000s of frames per second, millions of frames per second. And this is where we're driving. So in this paper, we demonstrate that a system architecture that in an immersive environment, trains agents that act, collect experience, and learn in these 3d immersive environments on the order of 100,000 frames per second on a single machine, single server. And the key was basically a bottom up from scratch, from first principles, system design with a lot of specialization. So we have processes that just collect experience, agents just run nonstop collect experience. We have other processes that just learn and update the neural network weights. So it's not that you have an agent, that goes out collect experience, then does some gradient descent step steps, updates its weights, goes back into the environment, collect some more experience with better weights, and so on and so forth. Everything happens in parallel, everybody is busy all the time. And the resources are utilized very, very close to 100% utilization. Everything is connected through high bandwidth memory, everything is on the same node, so there is no message passing. Because if you look at these rates of operation, if you're operating at hundreds of 1000s of frames per second, message passing is too slow. The fastest message passing protocol you can find is too slow, the message passing becomes the bottleneck in the system. So what happens is that these processes just read and write from shared memory. They just all access the same memory buffers when the new neural network weights are ready they're written into the memory buffer, when a new agent is ready to go out collect experience, it just reads the latest weights from the memory buffer. And there is a cute idea that we borrowed from computer graphics, which is double buffering. And double buffering is one of the very, very first things I learned in computer graphics as a teenager, we wrote the assembly code. And basically less than one in computer graphics, how do you even display the image? Double buffer is part of lesson one. The idea is that there are two buffers, that display points to the front buffer and that's what's being displayed, that's the active buffer. In the meantime, the logic of your code is updating the back buffer with the image of the next frame. When the back buffer is ready, you just swap pointers. So the display points to the start, starts pointing to the back buffer that becomes the primary one. And then the logic of your code are operating what used to be the front buffer. So the back buffer becomes the front buffer, the front buffer becomes the back buffer, you keep going. We introduced this idea into reinforcement learning, again, to just keep everybody busy all the time. So the learning processes work on a buffer and the end write out the new weights and the experience collectors have their own buffer that they're writing out sensory data into. And then they swap buffers, there's no delay, and they just keep going. Interesting. Would it be possible to scale this up if there were multiple machines and there was a delay in the message passing? So the distributed setting is more complex, we have avoided at so far. If you are connected over a high speed fabric, then it should be possible. We've deliberately maybe handicapped ourselves still even in a follow up project that we have now that was accepted to ICLR. We limited ourselves to a single node, because we felt that we will learn useful things if we just constrain ourselves to a single node and ask how far can we push single node performance. And in this latest paper that was just accepted to ICLR, We basically showed that with a single node, if we again take this holistic end to end from first principles system design philosophy, we can match results that previously were obtained on an absolutely massive industrial scale cluster. Yeah, I mean, your learning speed is so fast to me, it seems faster than actually what I would expect from like supervised learning, where you're literally just pulling the images off your hard drive. Am I wrong about that or? Oh, yeah. So in the latest work it's basically the forward pass through the ConvNet and that is one of the big bottlenecks. It's no longer the simulation, we can simulate so fast, we can simulate the environment so fast, it's no longer the bottleneck. It's actually like routine processing. Like even just doing The fourth best in the ConvNet. Amazing. So I guess like one more project that you worked on that I was kind of captivated by I kind of want to ask about. Because I think a lot of people that watch these interviews would be interested in it too is CARLA right, which is like kind of an environment for learning autonomous vehicle stuff. Can you maybe describe it? And what inspired you to make it? Yeah, CARLA is a simulator for autonomous driving. And it's grown into an extensive open source simulation platform for autonomous driving that's now widely used, both in industry and in research. And I can answer your question about inspiration I think in two parts. There is what originally inspired us to create CARLA and then there is what keeps it going. And so what originally inspired us is actually basic scientific interest in sensory motor learning and sensory motor control. I wanted to understand how we train intelligent agents that have this kind of embodied intelligence that you see in us and other animals. Where we can walk through an environment gracefully, deliberately, we can get to where we want to go, we can engage with the environment, if we need to rearrange it, we rearrange. We clearly act spatially intelligently and by intelligently in an embodied fashion. And this seems very core to me and I want to understand it, because I think this underlies other kinds of intelligence as well. And I think it's important for us on our way to AI is the loaded term, I think it's very important for us to understand this aspect of intelligence. It seems very core to me the kinds of internal representations that we maintain, and how we maintain them, as we move through immersive three dimensional environments. So I wanted to study this, I wanted to study this in a reproducible fashion. I wanted good tooling, I wanted good environments in which this can be studied. And we looked around, and when we started started this work, there just weren't very good, very satisfactory environments for us. We ended up in some early projects, we ended up using the game Doom, which is a first person shooter that I used to play as a teenager, and I still have a warm spot a spot for. And we used Doom and we used it to good effect and in fact, we still use it in projects. And we used it in the sample factory paper as well, I mean, sample factories, and now they're paper that that is based on Doom, essentially on derivatives of John Carmack's old code, which tells you something about the guy, right? So if people still use your code, 25 years later, you did something good. You did something right. But Doom if you just look at it, it's somehow is less than ideal, right? I mean, you walk around and in a dungeon and you engage in assertive diplomacy of the kind that maybe we don't want to always look at and we don't want our graduate students to always be confronted with. I mean, there's a lot of blood and gore and somehow wasn't designed for AI, it was designed for the entertainment of, I guess, primarily teenage boys. So we wanted something a bit more modern, and that connects more directly to the kinds of applications that we have in mind to useful productive behaviors that we want our intelligent systems to learn. And autonomous driving was clearly one such behavior. And I held the view at the time that I still hold that autonomous driving is a long term problem, it's a long term game. It wasn't about to be solved, as people were saying when we were creating CARLA, and I still don't think that it's about to be solved. I think it's a long term effort. So we created a simulation platform where the task is autonomous driving. And as an embodied artificial intelligence task, as an embodied artificial intelligence domain, I think it's a great domain. You have a complex environment, you need to navigate through it, you need to perceive the environment, to make decisions in real time, the decisions really matter if you've got something wrong, it's really bad. So the stakes are high, but you're in simulation. So that was the original motivation, it was basic scientific interest in intelligence and how to develop intelligence. And then the platform became very widely used. People wanted it, people wanted it for the engineering task of autonomous driving, and people kept asking for more and more and more and more features, more and more functionality, other large institutions like actual automotive companies started providing funding for this platform to be maintained and developed because they wanted it. And we put together a team that the team has ably led by German Ros, one of the original developers of CARLA, who is now leading an extensive international team that is really primarily devoted to the autonomous driving domain and supporting the autonomous driving domain through CARLA. That's so cool. I feel like maybe one criticism of academia, I don't know if it's fair or not, is that, it has trouble with incentives to make tools like this that are really reusable. Did you feel pressure to write papers, instead of building a robust simulating tool, that would be useful for lots of other people? Well, I maintain a portfolio approach where I think it's okay for one thrust of my research and one thrust of my lab to not yield the publication for a long time. Because other thrusts just very naturally end up publishing more. So it balances out, it balances out. I personally don't see publication as a product or as a goal. I see publication as a symptom, publication is a symptom of having something to say. So publications come out, they come out at a healthy rage, just because we end up discovering that useful things that we want to share with people. And I personally find it very gratifying to work on a project for a long time, and do something substantial, maybe than published. And if people use our work, and it's useful to them, that is its own reward to me. So even if there is no publication, if people find our work useful, I love it, I find it very, very gratifying. Mm-hmm (affirmative). Yeah, I can totally relate to that. Can I ask you a more open ended question since you're kind of getting to the end of this? I guess, I wonder when I look at ML applications, I guess, broadly defined ML. The one that is kind of mysterious to me is robotics, right? Like, I feel like I see ML like working all over the place. It's just so easy to find... Like suddenly, my camera can search semantically. But then, I feel like the thing that I can do, that computers most can't do is kind of pick up an arbitrary object and move it somewhere. And it seems like you've been really successful, getting these things to work to some degree. But I guess I always wonder like, what is so hard about robotics? And is this... Do you think there'll be like a moment where something starts working and we see ML robot applications all over the place? Or is this always going to remain like a huge challenge? I don't think it will always remain a huge challenge. I don't think there is magic here. The problem is qualitatively different from your perception problems, such as computer vision, and being able to tell your camera, where is Lukas? And the camera will find Lukas. The problem is qualitatively different, but I don't think the problem is insurmountable. And I think we're making good progress. So the challenge is that to learn to act, you need to actually act. To act, you need to act in an environment, you need to act in a living environment. If you act in a physical environment, you have a problem, because the physical environment runs in real time. So you're potentially looking at the kinds of debug cycles that we mentioned with a human toddler, or something takes a couple of years to learn. And in these couple of years, I mean, that toddler is also an incredibly robust system, right? The toddler can fall no problem, right? So during this time you run out of battery power, you fall, you break things, you need a physical space in which all of this happens. And then if you're designing the outer learning algorithms, you need to do this in parallel on many, many, many, many variations. You need many, many, many, many slightly different toddlers to see which one learns better. And it's very, very hard to make progress in this regime. So I think we need to identify the essential skills, the underlying skills, that... And I think many of these can be understood and modeled in essentially [inaudible 00:44:27] model systems. So if you look at neuroscience, for example, much of what we know about the nervous system was not discovered in humans, in the human nervous systems. It was discovered in model systems such as squids. So a squid is pretty different from a human. But it shares some essential aspects when it comes to the operation of the nervous system. And it's easier to work with, for very many reasons. Squids are just easier to work with than than humans. Nobody says that if we understand squid intelligence, we will understand everything about human intelligence, and how to write novels and compose music. But we will understand many essential things that advance the field forward. I believe, we can also understand the essence of embodied intelligence, without worrying about, let's say, how to grass with slippery pebbles, and how to pour coffee from a particular type of container. Maybe we don't need to simulate all these complexities of the physical world, we need to identify the essential features that really bring out the essence of the problem, the essential aspects of spatial intelligence, and then study these inconvenient model systems. That's what we try to do with a lot of our work. And I think we can actually make progress, make progress enough to bootstrap physical systems that are basically intelligent enough to survive, and not cause a lot of damage when they're deployed in the physical world. And then we can actually deploy them in the physical world and start tackling some of these last millimeter problems such as how to grasp a slippery glass, that kind of thing. That's so interesting, it really last millimeter because I feel like something just like, I mean, you would know better than me, but just like the way fabric hangs, or the way, like liquid spill, I understand that those are incredibly hard to simulate with any kind of accuracy as we would recognize it. You think that that's like, actually in the details, and the more important thing is like, well what is the more important thing then? To know how to simulate quickly? Or where's the productive access to improve? Well, one problem that I think a lot about that seems pretty key is the problem of internal representations of spatial environments that you need to maintain. So suppose you want to find your keys, okay, you're in an apartment, you don't remember where you left your keys, you want to find your keys. Okay? So you need to move through the apartment, and you need to maintain some representation of it. Or you're in a new restaurant, and you want to find the bathroom. You've never been there before, you want to find the bathroom. I've done this experiment, many times, you always find the bathroom and you don't even need to ask people, right? How do you do that? What is that? So I think these questions these behaviors step into actually an important, what to me feels like an essential aspect of embodied intelligence, an essential aspect of spatial intelligence. And I think if we figure that out, we will be on our way, we will not be done, but we will be on our way. Then there is the very detailed aspects, one of my favorite challenges, long term challenges for robotics as Steve Wozniak's challenge. Which is that a robot needs to be able to go into a new house that it's never been in before and make a coffee. So that I think will not be solved with just the skill that I mentioned to you. That does rely on some of these last millimeter problem of sort of the detailed actuation, also reasoning about the functionality of projects of objects. And I think we're actually far I don't think it's going to happen next year. I think we're quite far, but it's a very exciting journey. Awesome. I love it. Thanks so much for your time. That was a lot of fun. Thank you so much Lukas. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun. And it's especially fun for me when I can actually hear from the people that are listening to the episodes. So, if you wouldn't mind leaving a comment and telling me what you think or starting a conversation that would make me inspired to do more of these episodes. And also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",7482 +Dominik Moritz — Building Intuitive Data Visualization Tools,https://www.youtube.com/watch?v=bCTtibplEg8,2344,2021-03-25,"When we designed Vega-Lite, we build it not just as a language that can be authored by people but, actually, as a language where we can automatically generate visualizations. And I think that's also what distinguishes it from other languages, such as D3 or ggplot2 in R. Because we're in JSON, it is very easy to programmatically generate visualizations. You're listening to Gradient Dissent. Today, we have Lavanya with us who has been watching all the interviews in the background but we wanted to get her in there asking questions. And we're talking to Dominik who's one of the authors of Vega-Lite and we got excited to talk to him because we've been using Vega in our product and we recently released it, but it solves this huge problem for us where we want to let our users have complete control over the graphs in a language that makes sense. And then we discovered Vega, and it was a perfect solution to the problem that we had. And then we talked to Dominik, and he had so many interesting ideas about the way machine learning should be visualized and we didn't even realize he came from a visualization background. So, we have a ton of questions to ask him today. Super excited. I can't wait. I think the main thing or... You've done a bunch of impressive stuff, but the thing that is most exciting for us is that you were one of the authors of Vega-Lite. And so, I, kind of, thought maybe the best place to start for some people who don't know even what Vega is, is just sort of describe what Vega is and what the goals are and then how Vega-Lite works within that context. Yeah. So, the way Vega came to be is that my advisor, Jeff Heer... So, Jeff together with his graduate students, Arvind and Ham, created a declarative way to describe interactions. Building on ideas from functional reactive programming, which is a concept that's been around for quite a while. So, they've adopted this concept for visualizations to describe not just the visual encodings, but also the interactions, fully-declaratively. And so, that then became... I think that was a Vega version 2.0 at that point. Vega, at that point, was still fairly low-level, and that you had to describe all the details of the visual encoding as well as the axes and legends and potential other configurations. So, around at the same time, my colleague Ham, who also worked on the first version of Vega... on this reactive version of Vega, he was working on a visualization recommendation browser, at that point it was called Voyager and I helped him with it and we needed a visualization language to do recommendation in. And so, Ham and Jeff talked about the need for high-level visualization they were taking. Do a recommendation where you don't have to specify all the details, but really only what's essential, which is this mapping from data to visual properties. And so, I think they talked to the Vis Conference in Paris, and on the flight back, Jeff hacked the first version of it, which then I think the code is still what we were building on today. That's awesome. Sorry. Before you go too far down this path, I'm going to ask all the dumb questions that I feel embarrassed to ask. I mean, I feel like I've heard declarative language for visualization many times, and I always, kind of, nod but what does declarative really mean? What would be the non-declarative way to describe a visualization? Yeah, the biggest distinction between declarative and on the other side is imperative, is that in a declarative language, you describe what you want, not how you want an algorithm to execute steps to get to where you want to go. Good examples of that are HTML and CSS where you described what the layout of the page should be but you don't tell the layout engine to move something by a couple of pixels and then move again by a couple of pixels. Another good example of a declarative language is SQL or SEQUEL, whereas the database query language that people used to query databases for both analytics or for, let's say, banking system, for instance. And in these declarative queries you described, what do you want the result to be? So you say, ""I want from this table, the two pulls or the rows that have these properties."" And you don't describe how you're going to get that. And that's, as opposed to an imperative algorithm, where you would have to write the search, you would know how the data is stored, in what format, whether it's maybe even distributed on multiple machines or not in a declarative language, you only describe what you want. And then, that could run on a small database that's embedded, or it could run on a cluster of a thousand machines and you shouldn't have to worry. And so, for visualization, that means you shouldn't have to worry about how the visualization is drawn, how you draw a pixel here, a rectangle here, a line there. No, you just want to say, ""I make a chart that encodes these variables."" So, I guess how declarative, is it? Is it... And I have used Vega a fair amount, but I think people that are listening or watching may not have, right? So, I suppose the most declarative thing might be, sort of, give me an insight about these variables or just compare these variables, right? But that might be unsatisfying. What level are we describing the semantics of what we're doing versus saying, ""Hey, give me these three pixels here."" Do you say exactly the type of plot that you want or is that inferred? How does all that... How do you think about all of that? Yeah. The way will be built on this, this concept called the grammar of graphics. And that is a really cool concept that a lot of languages, even D3 have built on. And the core idea is that visualization is not just a particular type, so it's not just a horizontal bar chart, or a bubble chart, or a radar plot. But instead, a visualization is described as a combination of basic building blocks, kind of, like in language, we have words that we combine using rules, which is grammar. And so, the words in the grammar of graphics are two things. One is Marks and the other one is visual encodings. So, Mark is, for instance, a bar or a line or a point and encoding is a mapping from data properties to visual properties of that mark. So, for instance, a bar chart is a bar mark that maps some category to X and some continuous variable to Y. And that's how you describe a bar chart. And now I think what's cool about this is, if you want to change from a horizontal to a vertical bar chart or some column or vote chart, you don't have to change the type. You just swap the channels in the encoding. I have a question. So we see, so many really messed up charts that people make because people get too excited, especially, when they work with a really powerful visualization tool. And I feel like you've spent so much of your life designing really good grammar for visualizations and designing a lot of really cool plots. So, what's your recommendation for people... for best practices, for designing these visualizations? I think it is actually making mistakes. It is trying it out and seeing how difficult is it or how easy is it to read data in a particular chart. But before you actually go out and publish a chart and show it to the world, maybe think about, ""What can I remove from this chart?"" I think, a visualization is really showing what you want to show when it's showing the essential of the data. Very important in any visualization design is following two basic principles, and these are often called effectiveness and expressiveness. This goes back to some work from Jock D. Mackinlay who developed, actually, an automated system to follow these rules. So, these two rules, they're, kind of, oddly named, but essentially what they boil down to is, first, expressiveness. It means that a visualization should show all the facts in the data, but not more than that. So, what that also means is that a visualization shouldn't imply something about the data that doesn't exist in the data. And then effectiveness means to make a visualization that as easily perceivable as possible and one rule that you can apply there is to use the most effective channels first. And the most effective channels are X and Y or... They're like length and positions. They're the best. And then afterward, it's like color and size and some other things. So, that's why bar charts got applause. Line charts are so popular where's so effective because they are using those very effective channels first. But this also... Sometimes you have to go beyond effectiveness. Yeah. I always wonder... Is there any room for fun or novelty in a good visualization? Yeah, that's a good question. I like to, actually, think back to a paper from Tukey and Wilk. They worked in the sixties, this one of the famous papers about exploratory analysis and statistics. And they talked about the relationship of statistics to visualizations. So, the paper is full of amazing quotes and it's, kind of, amazing to read this today because almost everything is still true today. But one of the things they say there is that, it's not necessarily important to invent new visualizations, but think about how we can take the visualizations that we have, or the essential of the visualizations, and combine them in new ways to fit new opportunities. And so, I think there was a lot of creativity in making visualizations, even the simple ones, pie charts, line charts, scatter plots, but combine them in meaningful ways. Also, pre-transforming the data in meaningful ways. And so, there can be a lot of creativity in there. Yeah. Do you have a favorite visualization that you think is maybe underused or that you'd like to see more of? I think slope charts are kind of amazing. What's a slope chart? What's a slope chart? So, naming charts, by the way, is an interesting concept. If you think about the grammar and the concept of naming charts is kind of odd. I'm going to reveal a secret, but it's something I want to write. Like a system that automatically names a chart or the other way around, give it a name and it tells you what the specifications. Okay. But going back to the slope chart, the slope chart is, imagine you have two categorical variables, let's say two years and you have data for those years. And now what you could do is plot that as a scatter plot. So, on X you have the years and on Y you have some numerical measure. You should then draw different categories that exist in both years as colored points. It's hard to see, actually, trends between those things, between those different years. But if instead, you just draw a line between them, trends, or changes, they just jump out to you. And that I think is great. So, wherever you have categorical data and this bipartite graph and just drawing a line instead of drawing points there is great. It's called a slope chart? That's one name, one in the Vega-Lite gallery. Oh yeah. We'll have to link to that. So, I guess, where do you think about the line between Vega-Lite and Vega? Is it always super clear, what belongs where? Because I would think of declarative, I mean, they're both in a sense, right? A declarative language for charts, right? One, sort of, just higher level and one's like a lower level. So, where do you draw the line? So, maybe before we go there, one important thing to keep in mind is that Vega and Vega-Lite added something to the grammar of graphics. Vega-Lite in particular added, for instance, support for interactions. So, something that my colleagues Ham, Arvind, and I work together on where we edit some other kind of words or language construct that you can add to make charts attractive, and we also add composition. And so, these are high-level concepts, which, then, actually compile from Vega-Lite to the low-level Vega into, in this case, layouts and signals, which are these functional, reactive concepts that Vega has. And so, I think that helps me also, a little bit, understand the difference of where does what go. And what is, sorry, composition? Before I dropped that- Composition is being able to layer charts or concatenate charts. And we also have a concept called repeat, which is a convenient concatenation and then faceting. Faceting is, another word for it is trellis, a way to break down a chart by a categorical variable. So, for instance, if you have data for different countries, you can then draw one histogram for each country or one scatter plot for each country. Facet it to charts are also great. Often, faceting is a very powerful way if you have an additional categorical variable to show you data. So, is this where you make, sorry, a whole array, or a matrix of charts. That's what I'm picturing of the faceted chart, a grid of charts. I see. Okay. Cool. Yeah. That's faceting. Okay. So, you asked what's composition and then we talked about, Oh, Vega, Vega-Lite. I think the biggest difference really between Vega and Vega-Lite is the abstraction level. Vega-Lite compiles to Vega. So, anything that's possible in Vega-Lite is also possible in Vega because of that. But it requires about one or two orders of magnitude, more code in most cases. So, that's one big difference. And how do we achieve that? Well, one, we have higher level Mark types in Vega-Lite. So for instance, Vega only has a rectangle and has some more, but Vega has rectangles. Vega-Lite, actually, has bars as a concept. And so, if you have that, you can have some defaults associated with that high-level Mark type, which you then don't have to manually specify in Vega. In Vega-Lite, you don't have to specify because it gets instantiated and picks automatically. And then the other is sensible defaults or smart defaults. Essentially, you don't have to specify an axis. We'll make one for you if you use the XON coding. If you used color, we'll make a legend for you. Chose size will make a legend for you. If you use faceting, we'll make a header for you. And just, kind of, an axis. In Vega, you have to specify all the details of those marks or those elements. You can still override the defaults in Vega-Lite, but by default, we'll do something. And that's really what Vega-Lite is, it's a high-level language and a compiler that compiles from high-level specification to low-level Vega specification. Right now, we don't have a way to easily extend the high-level concepts we have in Vega, I'm sorry, Vega-Lite. We do have a little bit of an extension mechanism where you can add Mark micros. So, for instance, box plots in Vega-Lite are just a macro, which actually compiles to a rectangle, a line, and the little ticks at the end. And there's a bunch of other things that are just micros. And so, one could actually build a language on top of Vega-lite. And people have done that. I'll tell you, for instance, it is a Python wrapper, or Python syntax, Python API for generating Vega-Lite JSON specifications. And there are other ones in Elm and R, and then somebody made them on Rust and there's one in JavaScript. Oh, and Julia, there's one in Julia, as well. That's a really good one. I guess the R comment made me wonder if you have any comments on ggplot2. I feel like that's often like a beloved plotting library. Was that an inspiration for Vega at all? Or did you have reactions to it? So, ggplot2 came out a long time before Vega and Vega-Lite and it also builds on the grammar of graphics. At the time, really, was the prime example for an implementation of the grammar of graphics in any programming language, really. It uses, slightly, different terminology from Vega and Vega-Lite. To do plot has definitely been a great inspiration. And we, what do I mean? When I say we so, Ham, Arvind, Jeff, and I have talked to Hadley Wickham before. Yeah. Big fans of it. We actually considered using it for Voyager, but because Voyager was easier built as a web application, interfacing from mobile application to R would have been a lot more overhead than building on visualization. Totally. Maybe switching gears a little bit. One thing I thought was interesting about your background and interest is it's also machine learning. And I thought that was pretty interesting and cool. I wonder if machine learning has informed your thoughts about... Well, first, if it's informed your thoughts about visualizations at all, and then I'd love to hear about if you have suggestions of kind of visualizations that you think are helpful in the machine learning process. Yeah. I think visualization and machine learning are really good fits for each other. And so, I can think of two things that we can talk about both where visualization is useful for machine learning and where machine learning is useful for visualization. Maybe let's start with why visualization for machine learning. I think one of the most, and you can disagree with me there if you want to, important thing in machine learning is data if it's not the most important thing. I think few people would disagree. Okay. So, because data is so... Okay, we can agree that data is essential to machine learning. If you have bad data, your model is not going to do anything good. You can still create a bad model with good data, but good data is essential for a good model. And so, understanding that data that becomes part of your model or gets used to train the model is really essential. And I think visualization is a really powerful way there to understand what's the new data that's happening there. Especially, in conjunction with more formal statistics, but foremost statistics are only good when you know what you're really looking for. When you're still trying to look around, what's in the state of what might be problems with the data, that's when visualization really shines. And you actually built a library to help with the exploration of data, right? Yeah. So, Voyager and then Voyager 2, and some other follow-up work from there was, or, is a visualization recommendation browser. So, the idea there is that rather than having to manually create all the visualizations and still go through this process of deciding which encodings do I want to use and which Mark type I want to use, just let you browse recommendations and still be able to steer the recommendations. So, the recommendations shouldn't go too far from where you are. They should still be close to what you've looked at before, but they should take away some of the tedium of having to manually specify all the charts. And the recommendation is great for two things. One is yeah because it makes visualizations less tedious, and also, it can encourage best practices, for instance, good statistical practice. Or good practice, data analysis practice, is to look at the [inaudible 00:19:38] summaries when you start looking at a dataset. So, what are the distributions of each of my fields, each of my dimensions? And doing that, before looking into correlations between dimensions. And this is often difficult if you start looking at one field and you're like, ""Oh, there's something interesting here. Now, I wonder how this correlates with these other bits."" And then you're off on a tangent. And so, by forcing you, or by offering you a gallery of all the dimensions, and all the [inaudible 00:20:13] summaries at first, it makes it a lot easier to follow that best practice of looking at all the [inaudible 00:20:18] summaries first. Can you do this at scale? Let's scale it to millions of rows and how do you even begin if your data set is that big to find patterns in it? And how does the software scale too? Yeah. So, the software is built as a piece of research prototype that is built as a browser application where all the data has to fit into the browser. So, it currently does not scale. But the interesting thing about is that the number of rows shouldn't really matter too much, as long as we can visualize it. We could probably have a whole episode about that. Wait, the number of us shouldn't matter, in what sense? It seems like it would make it more complicated to visualize. I mean, it doesn't make the visualization, necessarily, itself harder, but it seems, actually, scanning through all of them might start to get impractical. Yeah. I guess most... There are two issues, one is a computational issue of just transforming that data and then rendering it. And then the other is, ""Can I represent the data in a way that is not overwhelming to the viewer?"" But assuming that we can do that for a couple of thousands of data points, or tens of thousands, or hundreds of thousands of data points, if you have many dimensions, the recommendation aspect gets a lot more difficult because now you have to think about, ""Okay, how do I represent all these dimensions? Let users browse them. How do I show correlations between dimensions?"" There's a lot more of... Correlations between three dimensions, get impractical very quickly. Yeah. So, that's a visualization for machine learning and then going the other way around machine learning for visualization is something that I've become pretty interested in. When we design Vega-Lite, we build it not just as a language that can be authored by people, but actually as a language where we can automatically generate visualizations. And I think that's, also, what distinguishes it from other languages such as D3 or ggplot2 in R, because we're in JSON it is very easy to programmatically generate visualizations, then we built a recommendation system on top of it. So, when we have a visualization language that is declarative and in a language that is easily generatable. We could think about ways to automatically generate visualizations from programs or models. One of those models is a model called Draco. My colleagues and I have been working together where we encoded design best practices as a formal model, and then we can automatically apply those best practices to recommend visualizations. And so, that can go beyond what I've talked about in Voyager where we would recommend this gallery of visualizations because you can consider a lot more aspects of, both, the data where the visualization or the tasks that the user wants to do, or the context that they're in, or the device that they're looking at. It's funny. I keep wanting to ask, actually, I don't know how to fit this into the flow, but I think one of the issues with visualizing data and machine learning, especially, with a lot of the deep learning folks that we work with is that the data often has... It's not like the sort of three independent variables and the dependent variable in a stats class. It's more like the data is like an image or the data is like an audio file. And so, I feel like just even visualizing the distributions gets unwieldy. It's also a little unclear what you would do with that. So, do you have thoughts about visualizing things where there's a higher-order structure, like an image or a video or audio file or something like that? That gets tricky because if visualization is two-dimensional, two-point something dimensional, maybe we can use color and size and every encoding channel, essentially, can represent another dimension, but after four, or five, or so, it becomes overwhelming. So, if you're having a data set with thousands of dimensions, I think the way to do it now is to use dimension and dimensionality reduction methods. So, tSNE UMAP, PCA to reduce the numbers of dimensions to the essential in some way, dimensions., Or create some kind of a domain-specific visualization. So in a way, an image is a domain-specific visualization that maps your long vectors of numbers to a matrix of color encoding. So, what do you think about... All of my Twitter feed is talking about model explainability and how that's still a very unsolved problem. So, what do you think are techniques that everyone should know, but, and how do you think the field is progressing? Do you think we can have interpretable models in five years, anytime soon? Are neural networks never going to be explainable? I don't know but that's a good question. I think many people are trying to answer. There's been a trade-off where people often made simpler models because they are more explainable and the more complex the model gets, the harder they get to explain. So, sometimes there are methods similar to a dimensionality reduction, I guess, to reduce your complex model to a simpler model, which you can then explain. But none of those methods are fully satisfying. Some of the techniques I've seen is using more inherently explainable models that are still complex. So, for instance, a good example of that is R GAM's general additive models, which are linear models of functions applied to every dimension. Well, why is that more explainable? Why is it more explainable? Because you can apply some techniques where you can understand, for instance, the function that gets applied to every dimension, individually. Or you can also then look at how do those dimensions... Where the functions applied to the submissions? How do those get combined in a linear function? Which is a lot easier to understand than some nonlinear combination of many dimensions. But when you want to have the different dimensions interact with each other, or allow for that, I guess maybe taking a step back, can you, kind of, make this a little more concrete for someone who hasn't seen this before? What would be... What kind of functions would you be imagining and how would they be applied? For instance, if you want to predict the quantitative variable, some number, let's say the use, the standard example, the housing price, the price of a house. Do you want to do that based on the dimensions, the available dimensions? Let's say the size of the square feet, the number of bathrooms, the number of bedrooms, or the number of floors. And so, now what you can do is do a linear combination of the dimensions to get the price. So, if you just take a linear combination or, you could say, multiply the square feet by, I don't know, 10, the number of floors by 20, the plus-size by 5, and then get a number out that does the housing price. So, that would be a simple linear model where you, essentially, apply a weight to every individual dimension. So, now what a general additive model does is that they apply a non-linear function to each dimension, individually. So, it can be like a log function or any other complex can be as complicated as we want, but because it's a function, you can actually visualize it very easily just by looking at the value on the x-axis and the value after applying the function on the y-axis. And so, if you then want to know what is the price of a particular house or the predicted price of a house in each of these charts per dimension, you'd just look up for my value. What's the corresponding value that goes into the sum and then just sum them up. I see. So, you could see exactly how much each thing contributed to your final score or your final prediction. Mm-hmm (affirmative). Yeah. And a very good example of if you want to, actually, play with that and try it out is at this system called Gamut, which is as a research project at Microsoft Research, where they built a system for doing exactly this task of understanding the model that is the general. And one of those GAM models, and both being able to, for instance, compare two predictions between for two houses, understanding how much each dimension contributes to the predicted price, and also make it very easy to compare what you look at the general model, the whole model in just one view. And yes, you don't have the ability to have multiple dimensions affect your output, but still, these models work fairly well and are a lot more interpretable than a model that computes many dimensions or incorporates many dimensions in every single point. Do you have thoughts on visualizations to help with understanding what's going on in a much more complicated models? Like, say, a convolutional network or a fancier type of network? Yeah. I think visualizations can actually help at different points. And I think visualizations are only as powerful, or only as useful as the task that you designed them for. So, I think in general saying, ""Oh, can you visualize this thing"", is impossible without a task. So, can you visualize X for Y? So, for instance, one could visualize a model for the purpose of understanding the architecture. And so, when you for instance have a complex network, but many layers and many different complex functions that every inch of your layer might want to visualize it, to see what functions are being applied, what parameters are being used, and how big is each layer. And so, there's a couple of visualizations. I think, one of the most popular ones is probably the one in TensorBoard, which actually my colleague Ham started when he was interning at Google. Did you mean the parallel coordinates plot maybe, or which visualization in TensorBoard? In TensorBoard, it's the visualization of the graph, the data flow graph. It's this... There is, kind of, two views in TensorBoard. There's the one where you look at your model outputs or your metrics. And there's the one where you look at the model architecture and I'm talking about the model architecture one. So, that can help you to, for instance, debug what's happening, but it doesn't help you at all to explain a particular prediction, for instance. So, for that, you might use a different visualization that has future visualization so it lets you inspect different layers and what's the attribution of in different layers. Cool. We always end with two questions. I want to make sure we have time for it. And I think we, maybe, should modify them slightly to focus on visualization. So, normally we ask like, ""What's a subfield of machine learning that people should pay more attention to."" Which I'm curious, your thoughts said, but maybe I'd also ask a sort of subfield of, kind of, a visualization that you think doesn't get as much attention as it deserves. I think for machine learning, I'm very excited that there's a lot more attention to understanding what's happening in these models. I'm also a huge fan of more classical AI methods, which I guess is not machine learning anymore. But yeah, I'm very excited about constraint solvers and using classical elements. Whoa, maybe we have not had that answer, constraints? I thought you were going to say SVMs or something with constraint solvers. No. Classical, like AI not even learn- I thought they used the ML to do constraint satisfaction these days. I don't know They use ML now for learning indexes and databases. I think these classical methods are exciting because they allow you to describe a kind of a model, a way, a concept, a theory in a very formal way, and then automatically apply it very declarative for a declarative problem solving and describing the problems and solving them. And these solvers are amazingly fast today, pretty excited in visualization. Because it's a science, we're trying to explain what makes the visualization good. And there's been a lot of work on high-level design of good visualizations. So, I talked about these principles of effectiveness and expressiveness earlier. And there are no systems to automatically apply them, and there are design best practices, and there are books, and people are teaching those in classes and, so on. And then on a very low-level, perceptual level, there's some understanding of how do we perceive colors and shapes and gestalt of shapes, and how do we see patterns. But we don't have a good understanding of how those low-level insights on perception, actually, translate to those higher-level design practices. And I think the two sides slowly are inching towards each other, but they're not this, they're far to each other right now and, kind of, slowly inching towards each other. And what I'm excited about is... It's, kind of, like the general relativity theory of how do these two actually combine? We need a unified theory there of how do two things relate. It's like, we know it's high-level, it's, kind of, like relativity. And we know this is small quarks things. We don't know how they relate to each other. We know the universe behaves. We know how little particles behave but when you combine it, that doesn't work. And so it, kind of, makes this crisis that physics has had for a while, as well, in visualization. Well, what a great answer. That's so vocative, I want to talk about that for another hour. Normally, we end with asking people really on behalf of our audience, kind of, what the biggest challenges are that you see in taking ML projects from, sort of, conception to deployed. Do you have thoughts there? I think one of the trickiest things in deploying machine learning is metrics. Coming up with good, meaningful metrics that you're optimizing... Too many machine learning it's optimizing a function, but what does that function? And how do I make sure that that's actually a meaningful function and, also, that it's going to be meaningful in the future? Because we know for many examples that issue over-optimizing a metric, that metric becomes meaningless. So, how do you ensure that a metric is meaningful right now and will be meaningful in the future? And it's actually tracking what you care about. It's a difficult question. And I don't know whether there's going to be one answer. I don't think so. Train a model on a bunch of different optimization functions and figure out which one it is or something. But I, kind of, want a specific guess about its biggest challenges around machine learning interpretation. And also when you're training models using visualizations to debug these models. Do you have any thoughts around that, maybe? As I said earlier, I think data is essential for machine learning and so, understanding data is crucial. And I don't know whether the methods and tools we have for general data analysis, how much they might have to be adjusted for machine learning. For instance, Tableau or Voyager, all these tools that are designed for explorer-type analysis of tabular data, where do they fall short when it comes to machine learning? Because it was pointing out earlier that machine learning often has these high-dimensional data images and sound and, so on. Can we design other representations? I don't even want to say visualizations, but just representations that help us see patterns in that data, meaningful patterns, meaningful for the task of training them up or understanding the model that I think is going to be an interesting question for visualization tool designers who'd like to work in the machine learning space going forward in the future. You know, it's funny. I feel one thing that everybody working in machine learning misallocates their time a little bit, including me, is I feel like you almost always spend too much time looking at aggregate statistics versus individual examples. Every time you look at an individual example, you're just like, ""Ah, like I can't believe I missed this stupid thing that... It was breaking my model or making it worse in some way."" And so, I wonder if the gap is... We have really good tools, I feel for, aggregate statistics, but it's hard to quickly drill into stuff, especially, when your data sets could get very large. I believe, actually, that we have... I totally agree with that. We have very good tools for looking at aggregate statistics. I think we also have reasonable tools for looking at individual examples. Go look at an image. That's okay. We're in a row at a table. But I think where it gets really tricky is understanding the in-between. So, understanding the subgroups that exist in the data and that is because there exists m to the m possible subgroups in a dataset. And if you have a million rows, that's a lot of subgroups and only very few of them are actually meaningful. So, understanding which subgroups are behaving oddly, or are negatively affecting your model, and looking at those, that is a challenge that I see over and over again. I think this problem of not aggregate and not individual, but somewhere in between and wherein between do I want to look, that to me is where we're at the difficulty lies. All right. I think that's a nice note to end on. Thank you so much. That was really fun. Okay. Yes. Thanks for all the questions and everything. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's, especially, fun for me when I can actually hear from the people that are listening to these episodes. So, if you wouldn't mind leaving a comment and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And, also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",6462 +Cade Metz — The Stories Behind the Rise of AI,https://www.youtube.com/watch?v=ta2hj9b9R-E,2949,2021-03-18,ai is a weird field right it's it's this combination of various fields and it's always been like this right since the 50s when the term was coined it's this blend of computer science and neuroscience and psychology that has always been the case and it continues to be the case you're listening to gradient descent a show about machine learning in the real world and i'm your host lucas beewald kade metz is a journalist who's been covering technology for the past few decades and he recently wrote a book genius makers which is kind of a historical document up until the present about artificial intelligence and the people that built the technology behind it i have so many questions about this book i can't wait to talk to him so you were the first non-ml practitioner to to ever appear on this podcast so i'm excited to do this and we might take things into different direction than um normal but i was really excited to talk to you i kind of procrastinated on on reading your book and then i actually really enjoyed it i was kind of afraid that i wouldn't you know the name made me a little worried that it might be a bit over the top or something like that and i also felt like typically when you know when you read journalism on topics you know really well it's hard not to be critical or feel like you know the the person didn't you know get something exactly right but then i actually it's it's you know it kind of reminded me of that show silicon valley just in its like incredibly accurate details like i feel like i've been in this world of of machine learning and i've been in a world of venture capital which are kind of the two main topics that your book covers and just all the little anecdotes and details they really just rank true to me like i felt like you know you do these things where you explain math you explain sort of like when someone makes fun of somebody for you know differentiation by faith what that means or like you know you describe what a tpu does and you really actually go into technical detail that i'm not even sure i would necessarily do you know if i was writing for a mass um audience and i actually think you are remarkably accurate in that and then you sort of describe these like very vivid scenes that just seem like you know sometimes i feel like when you read sort of descriptions after the fact of the scene of like an acquisition or a fundraiser or something it's like i don't think this journalist really you know it was getting accurate information or transcribed it the way it just didn't doesn't feel right sometimes but your book really felt accurate to me and it was like a really interesting lens for me just on a world that i've been sort of adjacent to you know some of the folks have been on our podcast some of them are customers of ours now so i know i know a lot of the characters in your book but i don't kind of get to know them intimately in the way that that you clearly got to know them so i actually you know the question i was kind of dying to ask you which has really maybe nothing to do with ml is just how did you get so much access and what was your process for researching this book because there are some details i'm just surprised you got someone to tell you and it's not like you're sort of recanting interviews that you did it's like somehow you just it seems like you must have actually sat down with jeff hinton for a significant amount of time to be able to write this or or maybe of some other process that i don't understand no i mean well well i will tell you but like you know it's really interesting for me first of all to hear what you thought you might get and then also what you might have gotten in in the end after reading it like in in a way there was a there was a time when i started you know work on this book and really got into it when i realized it was a really dumb idea because on one once that your audience hopefully is going to be machine learning professionals and researchers who are really steeped in this stuff and if you if you venture too far outside that world you're going to to get them angry with you and you're going to lose them but ultimately the the goal of the book should be to have any reader pick this up and enjoy it and that should be the goal as well and if you move too far towards the machine learning researcher you're going to get those people angry and they're not going to take up your book and the trick becomes to to find the sweet spot right in the middle and and that's that's very difficult and then on top of that within the machine learning community we act like that's a monolithic thing it's actually this huge spectrum as well and this is what you know i get to at the end of the book really is that you know you have some people who believe this is just math and you have other people who where this is something more right this is this is sort of this almost it's almost like a religion and this is going to create agi this is going to create a machine that can do anything the human brain can can do so within the ml community you have this spectrum where people really really disagree and and and the goal is to in somehow you know get all those people interested in your book it seems like a mistake um to even try that but here here is what i really believe in and this sort of gets back to your question ultimately this is a book about people right this is a book about some really interesting people and they are interesting in incredibly different ways from jeff hinton to demisabus at deepmind to jeff dean i can go on down the list tim neat gebrew who is in the news recently because she really clashed with people at google including jeff dean you know these are these are really interesting people who are relating a lot of different ways and ultimately that's what this book is right it's a book about the people and what i realized is that i if i can just show who these people are and who they and what their stories are and how all those stories fit together then that that's what makes it successful right and what what that's about what what finding what those people are about ultimately it's about spending the time with them you know as you indicated right and that takes a lot of doing right you just some people because they work for these giant companies you can't really get at them initially so you you try somebody else and you get some good stories from them and you go back to the first person you say hey i've got this what else can you tell me and you develop in some ways a relationship with them you know i tend to like even as you get close to these people you know keep a little bit of a distance as a journalist i think that's important too because again you've got to have an objective view and be able to to really appreciate and rope in you know the beliefs and the experiences and the points of view of all these different people but it's about years and years and years of gathering information um and understanding it yourself and and taking it back to people and say can we talk about this more and then somewhere along the way you you know you get them to talk well you know i can only speak for myself but i thought it was a really interesting book i mean just my i you know i couldn't put it down once i started it one thing i was wondering about is the ending i thought was very understated like you know it's it sort of ends with with jeff hinton who's kind of the main i mean i feel like he's almost like the main character in your book kind of saying like well maybe agi isn't that important how about spoiler alert i guess well but you know but you know i and i thought i thought it doesn't even like like he's kind of like well what like would you really want a vacuum cleaner that's this was my takeaway i'm curious like if i if i got it wrong or different than your intention tell me but you know i was thinking does he really like a vacuum cleaner that could like you know navigate my house and and be smart about like when to turn on and off and stuff i mean it doesn't really exist and i actually i think i would want my vacuum cleaner i think to be you know reasonably smart and and you know bordering on you know if it could you know reason about the world that that seems like it would actually be kind of better than than a roomba that that you know can't yet i was kind of surprised that jeff hinton thinks that and it sort of felt different than what most of the people in your book were thinking i was just kind of curious and and then you actually say you sort of say but you know he kind of invested in you know some crazy reinforcement learning company so maybe he doesn't even really think that you know what i'm kind of just kind of left with i wonder like what's the what's the takeaway here and i was wondering what's what's kind of your takeaway because you really noticeably never seem to take a position on this stuff but i mean you've been watching the field for for decades i'm sure you have um opinions on this right well you know it's interesting that you would in in some ways have that takeaway that the ending is understated and sort of questioning agi almost you know my first book review has come out there's this trade publication called kirkus which which reviews big books and their takeaway is completely the opposite like if you read the review it is kade metz is making the case that agi is possible right and so it is it is that they completely other ending spectrum in you and and i completely understand why the two of you have come to different conclusions because and it makes me happy because that's my aim my aim is not to judge my aim is not to make a call my aim is to show what is going on in the world and what has gone on over the past 10 years and you have someone like jeff hinton who is one of the most respected people in the field at the same time there are people who who can't stand him because they feel like he has gone too far there are people who can't stand him like because he doesn't go far enough and doesn't say that agi is around the corner and you know you're right the book ends with him you know in a way questioning agi but it also ends with him changing right him embracing some stuff namely reinforcement learning that he hadn't embraced in the past and he sees the value there and he sees it accelerating and in a way he's just not going as far as some other people and you know i think the book ends with some people who who take a a very different view and who do think a agi is around the corner and it states them very explicitly right and you know i i think it's i think it's about showing where all these people are coming from and letting the reader make their own decision about what's really going to happen it's kind of interesting i mean one of the things that struck me about your book is you're sort of describing in a historical like a story and feeling way something that's like completely in progress like there's a whole bunch of things that happen you know right after your book stops right like you know timnit and jeff dean and all the stuff that happened at google and then i was actually thinking it's funny like joshua i think in your book he's not really doing a lot of commercial stuff in contrast to some of the other characters but then you know i think element ai you know recently kind of sold and that was like a bit of a you know controversy if that was a good outcome or a bad outcome and then you know open ai went you became a private company i did you have a sense of like the book needs to stop here or was there other stuff that you kind of wanted to include could there be a sequel to this there could but you know what's interesting about all the stuff you mentioned and i would argue like almost everything in the book it is completely in tune with what happens in the book right now timmy gabriel is a character in the book and the same things happen in the book it's just with a different company right it's with amazon right okay it's not with google then it happens with google yahshua bingio you know he has stayed outside or more outside of the commercial realm than hinton and lacoon but as the book goes into like he dips his toes certainly right in the book you know it's you know it's his partnership with microsoft he's also had one with with ibm and with all these people it's sort of a balancing act and i think that's what the book is about is that you have these very idealistic people whether it's to me or it's joshua or hinton and they all come into contact with these forces that are frankly much bigger than them these corporate forces these government forces and you know when that happens there's going to be conflict and all those conflicts that have come since i finished the book it's all happening in the book as well and you know what you know open ai you know how many times have they gone back and forth you know as far as what they're going to do and what they're not going to do are they a not-for-profit are they a public company do they believe in withholding the technology or sharing it right these things will continue to go back and forth but the the constant is is that clash right of of belief and then you know those corporate forces which are about money and about attention and promotion you know i think that those are the constants and and that's that's why i really believe in the book is because all those things are just going to continue to play out in the years to come i mean one thing that i really was curious to ask you about is you know you kind of set up these kind of dichotomies that you personify right and like sort of gary marcus versus on the maybe or like elon musk versus zuckerberg should probably see what those are people listening to probably guess what the you know what the dichotomies are here sure i i like i was curious where do you land on the stuff now that you've kind of talked to everybody like do you feel like for example like do you feel like we are you know sort of like overstating the future progress of ai it sort of seems like if you take a historical view like you're taking it seems to me like ml has just kind of made this sort of steady incremental progress and people keep moving the goal posts of like what you know what it means to do agi like first you have to win at chess and then you have to let it go and you know you know then it's like you have to pass the turing test but then that doesn't even you know that's not even hard enough and so like when i take for me when i take a historical view i sort of imagine steady progress extending out into the future then when for me when i look at these algorithms it sure seems like a stretch that they turn into you know kind of agi just with with more compute so i actually don't even know where i land but i'm curious where where you land on this topic well i i think you're right you you have to look at this historically and that's what the book does is that a lot of the claims that are being made now about agi and this sort of pervading our lives and sort of taking away jobs all that has been around since the 50s right and i showed that in the book and in a way it's just a repeat of that now that said there has been a huge amount of progress over the past 10 years which is what the book really covers right we've had a huge amount of progress what what i really believe firmly in as a journalist particularly as a as a new york times reporter is i feel like what has happened and what is possible in the public consciousness is way out of whack right and a lot of that just has to do with the term artificial intelligence you know which has been thrown around so much over the past 10 years that alone gives people the false a false impression right about what is happening and what will happen and then you know frankly most people write about this stuff they you know for whatever reason they don't really understand what's going on and and they exaggerate and maybe they exaggerate consciously maybe they exaggerate unconsciously maybe they don't know that they're exaggerating but if you sit down and you read most of the stuff is written you have a false impression and what that that is one thing that i really want to at least in my small corner of the universe try to correct and show people what is really happening right and the fact the matter is none of us knows what the future is and you know as much as you know someone who really believes in agi might get on this this you know call with us and you know get angry at me for not saying agi is around the corner the reality and i think the book shows this is that none of us know what the future holds and when it comes to agi it's an argument it's a it's a religious argument right and i show that in the book people with the same experience the same knowledge the same respect across the industry really disagree on this but go ahead no no it's funny i guess like one thing that's sort of it's almost in the water so i don't think to question it because i kind of swim in it right is why do you think it becomes such a religious argument like why do you think people feel so passionately frustrated that you know other people don't agree with them on this on this particular topic of like is aji possible or coming or coming soon well i think that people are just coming from from a very different place when they start talking about these things and one of the things you realize about silicon valley is if you're going to be successful you've got to really believe in what you're doing right that's again either consciously or unconsciously that's how you attract the money that's how you attract the talent that's how you get these things to snowball okay whether you're building you know a tiny little app that does something simple or you're trying to build agi so what has happened is people have taken a rule book that has worked in silicon valley for certain things let's say facebook right and they're applying it to this notion that they can build a machine that can do anything the human brain can do so in their mind they're just doing what everybody else has done right but agi is different than facebook right that is a goal that is far far bigger and so you know in their world they're just doing what everyone around them is doing has done for the past you know however many decades in silicon valley but for someone else like they're taking a huge step and they they just do not see that right how can you extrapolate from from you know a machine that can play go to a machine that can do any anything human brain can do and if you ask people to to describe to you how that's going to happen right that's at very least that's hard for them to do right describe how that's going to happen right you know the path that they see is a path that they painted very broad strokes and you know saying i can build a facebook you know there's a path there to building a social networking app we know how to do that we don't know how to do this and we don't know how to build a self-driving car right that alone is an astronomically difficult project that we don't quite know how to complete yet but people still talk about it in terms like it's already there and on one level you see why they do that but on another level right it misleads the public it misleads people about what's going to happen soon so i guess i sense that one opinion that you kind of hold is that there are a lot of over-inflated claims and therefore the public feels like the public does not have a good sense of what's possible and not possible that that'll that at the very least is true right you know who knows you know tomorrow we may have a new technology that really blows us out of the water but what we've seen over the past um 10 years with this are repeated over inflated claims just in the sense of they don't give think about your mother right or my mother when when they read stories even in the new york times over the past a few years where they assume that tomorrow we're going to have cars that can drive by themselves all over the place right they can't help but have that assumption because that's the way it's written about and journalists write about that way because people like elon musk and so many others just say it's around the corner and they take them at face value right so i think that's really where the problem is that you're you're misleading the general public and and i do think that that's a real real problem right in a at a time when our society is grappling with what is true and what is not let's let's make more of an effort to to say what is actually possible now and show people what the reality is now and and try to do that in a way that's separate from what might come right the reality is now is that self-driving cars aren't up to the task but it's kind of interesting you say that because well i wonder if maybe journalists are at fault then because like certainly elon musk has a pattern of over stated claims but i i think he might be a little bit of an outlier i mean you would know better than me but i feel like when i talk to ml researchers they tend to be fairly understated or almost like maybe a little too reticent in their claims and maybe the ones that rise to the top aren't like that but you know we've done like 30 40 interviews on this here and i almost feel like i'm trying to push people to you know like extrapolate what you're doing like it seems like a big deal i don't know like when you talk to like jeff hinton or actually let's go way back right in your book you talk about like rosenblatt and then the you know the new york times i think or yeah it seems like a lot of journals the journalists kind of write about what he's doing saying it's going to get consciousness soon when he's basically like you know doing like you know like a perceptron without even like a second layer exactly so what happened there like do you think do you think rosenblatt has a responsibility to communicate better what's going on like was he making over-inflated claims at that time well yeah he clearly was right i mean you know he's telling these reporters that you know we're going to have systems that can walk and talk and and recreate themselves and you know somehow venture into space right like and so the reporters are just going to report that right right okay and in a lot of ways it's not that different now and you talk about elon musk being an outlier that's true and it's not like again you talk about ml researchers that is not a monolithic group like that's the other thing i want to show people is that even the new york times has written stories ai experts say x right well ai experts that's not you know one group it's you know it's this like spectrum of people and if you you got to remember like deep mind and open ai are founded on the notion that they are going to build agi and there are people at those companies who really really believe that and they're at the top of those companies and they may not be as cavalier as elon musk they may not have the megaphone that he has but they really believe that and those are important companies right they have a lot of serious research talent particularly deepmind has had some really important breakthroughs you know just recently the cast contest breakthrough that that's really important research that in in some ways is separate from this you know notion they're going after agi so these are important important labs that are founded on this this belief right and you know i i've known demus hasabis you know the co-founder of of deepmind for a long time now and whatever you think about that belief of agi you got to take that guy seriously right he you know he has a track record he is he is a a serious serious person and you may have a problem with a lot of the stuff he has done or said but you have to listen to him right and i mean similarly the work coming out of open ai it'd be hard to argue it's not super impressive like you know so i feel like some people claim that it's a little so there's a little bit of publicity stunt but you know you know like you talk about the robotic hand manipulating a rubik's cube and that's really impressive and maybe the rubik's cube makes it more fun but you know i i still think it's an amazing breakdown i agree i completely agree it's both right it is super impressive science on the one hand and it's a stunt it's both and and me as a as a new york times reporter as a book author my job is to show you that it is both right and give you a really real sense of what's going on there it's very easy to see that hand right this five-fingered robotic hand solve um a rubik's cube and think agi is gonna happen tomorrow right if you're if you're not educated in the field it is super easy to think that and so my job is to say there is an advance here right and you can see it but like there there are some chinks in the armor and and the other thing that i've seen is that not even everyone at open ai is aware of the chinks in the armor right and that that hand while the result is super impressive there are some caveats there that show you even the science isn't quite where you think it might be you know let alone sort of the stunty nature of it right you know my point over and over again is that these things are complicated i guess you know maybe this is inserting myself into it it's a your story but you know i was kind of there throughout it and i couldn't help but i keep having this thought you know i was at the stanford ai lab in like 2003 2004 at the sort of like nadir of interest in neural nets and you talk about this in your book and and you know i felt like the zeitgeist there was kind of like ah these neural nets are kind of like the name is too good like you know we use support vector machines not like neural nets that's like you know that's not serious and it's like these people just start trying to like hype these things and yeah they sort of work but they're like kind of tweaked to the point where they're like overfitting and serious people wouldn't you know wouldn't make a system called a neural net and it's been kind of interesting to watch it turn out that the neural net strategy actually really works you know like the perceptron is like the base thing that that now is like you know used everywhere and so i actually kind of feel like maybe the the folks i was working with at that time you know weren't dreaming enough like it's you know i think it's great that angering you know kind of you know when he saw it working really you know invested into it but i you know i remember like you talk about some stories about like the skepticism of the progress of neural nets and i like vividly remember that just like everyone says they have a better algorithm especially neural nets but then but then they were right and i kind of wonder if you feel like there's any lessons to that because it seems so remarkable that something would get all this attention and then sort of like you know be thought of as bad and then kind of come back as like the working technology like i wonder if there's other technologies out there that that have followed that same path well i think it's i mean it's an incredible story right i mean that's like it's amazing that some people kept working on that this stuff and that that you know again is at the heart of this book and it's something that i have always really um been amazed by and impressed by is is someone who keeps working on something even in the face of everyone telling them it's not it's not going to work right that is the basis for any good story and that certainly happened here and it will keep happening and in fact you know in some ways you've already come full circle where you have this sort of the let's call them the gary marcus crowd you know you know who are saying the same things like you know neural nets don't do everything these guys say they're going to do they're limited and you know and and so in a way they're still fighting the the the same battle right but but you're right there are other technologies that will come along have already come along that people are skeptical of that you know that are going to work in the face of that and and it takes that right it it takes that belief and and that determination and and just sort of years and years of hard work to make this stuff you know do what it's what it's ultimately going to do it seems like a lot of the characters in your book i was kind of struck by i don't have like a good stat in this so i could be wrong but it seemed like a lot of them didn't come from a computer science background like it's like a remarkable number kind of came from biology and neuroscience and and things like that do you have any any thoughts on that i i agree and that's another thing that i'm fascinated by is ai is a weird field right it's it's this combination of various fields and it's always been like this right since the 50s when the term was coined it's this blend of computer science and neuroscience and psychology that has always been the case and it continues to be the case and and this is embodied by again my main character jeff hinton right who he is he is someone who didn't come at this from the computer science angle and he's still like one of the running things in the book is that he loves to downplay his skills as a as both a computer scientist and a mathematician and you know he doesn't think of himself as either he he you know he comes at it from that direction and and sort of gives this what is really just math you know a perspective that you wouldn't necessarily um expect it to have and that bothers some people and and some people don't understand that perspective that he gives it but that you know that is how he thinks and it has a real influence not only on on how this field has progressed but it does have an influence on how people perceive it right people don't understand when he and others as much as they explain it and re-explain it they don't understand them calling a neural network you know a facsimile of of the human brain they don't understand that's just a metaphor in some ways right but that's but that's part of the way this field works well i guess from a historical lens maybe the the takeaway is that you know being an outsider is an advantage in in some ways absolutely absolutely and and that's that's sort of the story of silicon valley as well right but that doesn't mean that just because you're an outsider that you're going to be right you know not not every outsider is right some are and and and some aren't and i think that's the story of this book as well probably everyone else is going to ask you this question but i felt like i had to ask it do you have any kind of like fun stories that you couldn't fit into the book because they didn't quite fit or any any good anecdotes in all the research you're doing that's a good question let me let me think that over most of it's in there to tell you the truth i mean like all the good stuff some of it is just unbelievable and and it took a a long time to get and it and it's you know once you have it from one person you got to get it from another so there were a lot of things right including like the lead story in the book and the prologue like that i wasn't sure i was going to be able to get in there and and thank goodness i did talking about the auction of the the company dnn exactly and in particular the price right that was one of the hardest hardest facts to nail down you know i have to tell you that's the only anecdote in the book i don't totally believe it was the one where it just maybe it's because it's actually true it just feels unbelievable it is 100 true and including wait to the part that i felt like it might have felt that way to the people involved but it's hard to believe it actually happened like this is they like literally got google and baidu to like bid at a particular time like they're running like a sotheby auction or something is it are you sure that's true that's amazing no but it's true because i love i've talked to i can't tell you the number of people i talk to who are involved in that like directly involved in that it's absolutely true and it's i guess that's that's how it goes right like the thing that's really true is like actually unbelievable exactly and but like so many parts of that story are amazingly you know improbably true because it encapsulates everything right at the very beginning of this movement let's call it a movement you know like the or what is it like the very beginning of this explosion in ai height in neural networks starting to work all the players there you know who would be involved are already there right from china and baidu to google to microsoft to deep deep mind is there right they're all there in this competition that would play out over the next 10 years like i you know and and that's that whole story came to me in bits and pieces right over the course of it was really you know months or maybe even years and as each peach pops into place you're saying this sounds too perfect to be true but you know it's true because it's coming from multiple people and you know and it's verified by multiple people and and all the perspectives kind of come together and some people say well i won't tell you that and then you get it from somebody else and they say okay yes it's true right that's the what's most fun about being a journalist is when you when you get those those nuggets that just show you so much you know about human nature and also just help your story just fit together in ways you never expected i never expected the book to begin with that but it had to begin with that because it's just it's just the greatest story it's a good story and you go back to it a lot and yeah it is a great story i guess one more just thought that i had reading your book is i i hadn't quite had the timeline in my head of like when neural nets started taking off but i feel like one thing that's kind of impressive is i feel like you know elon musk and zuckerberg and and larry page i feel like they noticed that neural nets were working really well before most academics even noticed it like i feel like they they like i was thinking about the timeline i was thinking about when you know and i'm in mlm selling to ml companies for the last 15 years and and i feel like actually they were really early like how did they figure this out it's remarkable isn't it and i think one of the things you can do is contrast the way they reacted and you can you can criticize the way they were at you could say they went too far of course but contrast the way google and facebook reacted to the way microsoft reacted right and microsoft did not jump on it the way that those two other companies did they didn't see it the way that the leaders of those companies did you know part of the narrative there right in my book is that you know jeff hinton was in microsoft's lab doing this stuff with speech and it worked in a way that nobody thought it would work nobody you know in the in the ml community nobody at microsoft and it works and they're all shocked they're all blown away but they don't jump on it the way that google and facebook did that's really really interesting and you you do wonder you know is it about the age of the company is it about the the general area that the company plays in like google had a real need for that speech recognition system that hinton and his students built in a way that microsoft didn't right because it had android it had a place to put it now it was also a company that and this is talking in broad strokes that that would take new technologies and put them into play far faster than microsoft would especially in those days right that's part of it but you know in the end it's a combination of these things right it's the way the leaders think is the way the company is built which in some ways is a reflection of the leader it's about the age of the companies right once these companies get to be a certain size like microsoft it becomes harder to jump on on something but like you see in the book the way that google jumped on it and it's astonishing right you know there's that conversation between larry page and alan eustis you know where he says you you're you got to bet big on this and this is you're right this is before even the ml community at large really understood what was going on and larry page is is telling allen uses to basically bet the farm on it it's astonishing it really is i guess my takeaway is when i see something working i'm going to jump on it but but even then like like you know it's unclear where it's going to go right like you know it works for speech and then it works for images and that imagenet is such a big moment but then people in the ml community are still like is this really going to work with natural language i mean years later they're saying that is this really going to work with natural language and then it does right you know these these large language models a lot google bert gpt-3 you know it really started to work and there was real doubt there and you know it's it's hard to see these things even when you're close to them and and you know we could go on down the line robotics it's not clear even when this stuff works with multiple different areas whether it's going to work with the next one one theme that also comes up in your book of course because we're talking about academics is sort of like who gets credit and who doesn't get credit and where's credit deserved and actually one anecdote did i i never knew that you have in your book despite voice being a pretty good friend of mine is that alex that was originally called wagner is that do i have that right i can't believe he never told me that i feel like if i was him i would it's a great story right you know why do we call it alex day you go you go to the paper and paper doesn't really call it alex now it's like everybody calls it that well the way it worked was and you know this is in the book you know in you know in a much more elegant way but like google has started to build its own version basically and it was voice check who who did it and the and the the way it worked at google was whoever built the thing you named it after them and so that's what they called it and then you know hinton and krashesky and zeus cover show up and they're like why do you call it that right it's krashesky who built the thing so they just start calling it that and that's what propagates right all over the community i think that that's it's a testament to those guys right that you know they're rightfully so in a lot of ways revered in a way they had some capital right but it's also just funny how those those things work in the in the tech community and sometimes those those those things are are corrected so to speak sometimes they're not right well who do you think so is there something that stands out to you as kind of not getting the credit they deserve because most of people that the heroes of your book i think are really really well known at least of people listening to this but do you feel like someone really do people talk about someone when you when you interviewed them that that doesn't show up so it's in such a big way well you know you know i think jurgen schmidt humor is is is the classic example right he's been written out a lot a lot he's written about in my book you know the reality i don't know that he comes across so well in your book interesting okay i don't know i i think you know i what i was gonna say is with all of this stuff it's complicated okay and and let's take let's go well before we get to jurgen let's start let's start with alex net you know the reality is is although alex krishevsky and hinton and ilya sutscover you know did the work on that and really made it happen they are building on the work of john le right they're using a modified version of his algorithm and he's building on the work of so many others everybody's building on everybody else's work and and on some level they all deserve credit right and you know what schmidt huber is saying is you know these guys who work for these very big companies are getting this credit and and i'm not right and you know i re i really like jurgen and and and i and i feel for him at the same time he is out there saying give me credit give me credit right and that's that's part of this too right right some people do that some people let the the credit come to them right and that's going to be viewed in different ways right some people are going to criticize jurgen for saying give me credit give me credit but but you know i know him and and and you can't help but feel for him as well because you know the reason that these others have gotten so much credit in large part is because they had these giant companies behind them right and and you know these companies are good at you know at producing and driving narratives and you know some of the narratives that have been out there aren't necessarily true right there have been you know published stuff a lot of it came from the companies that don't necessarily give the real view of these things and the real view is that you know it's it's more complicated than you think do you think there's a topic in ai that the press should cover more than they do i think it's more about and i guess i'm going back to what i've said before is is the press needs to cover this in a different way right you know and with more skepticism i guess with more skepticism and and it's look it is hard like again we're talking about you got to strike a right the right balance between you know showing people what's really going on but not going too deep in the weeds like you don't want to lose people and and and that's that's a very hard thing to do but you know when it comes to topics what i will say is that you know a lot of people have written about this this clash at google between timmy and and the company you know she's saying that she was fired and some people at google saying that wasn't the case and and you know in a way it's you know it it's it's a very specific argument but i i think this is really representative of a much larger clash that is is going to have to happen in this field right these language models that are being built these giant you know gpt-3 style language models they are inherently biased right that is just that is just a fact because human language is biased and these things train on this enormous amount of text they're biased and and and they spew hate speech and other toxic material that's just that's just the reality and that's what tim knee and others were saying in the paper that was at issue at google that battle is going to if these models are going to have to co if those mods are going to continue to progress and they really get out into the world that battle is going to happen it's going to have to happen in on a much larger scale at so many different companies right and what's the battle like what are what are the two like visions of the future well on the one hand you have a company like microsoft who put out you know a much simpler conversational bot years ago now called tay right yeah of course i remember that yeah it was rules based for the most part chat bot and it started spewing hate speech and it created this huge you know backlash and they and they took it away okay microsoft ostensibly you know is going to put gpt-3 out in tandem with open open ai that is a clash waiting to happen right microsoft's got to deal with the fact that these things are biased and that's going to offend a lot of people right how do you deal with that that's an open question it's an open question for microsoft for google for facebook for open ai on the one hand you have science really progressing and doing amazing things but you have this problem it's a problem for a lot of people right and some people don't see it as a problem they just think we need to release this stuff and you know get over you know your issues with the bias and the hate speech and but a lot of people think it's a real problem and and to the extent where you know that clash is going to have to happen if those models are going to continue to progress and to get out in the world right you got to find a way to deal with it whether it's technically or or by other means right and you know that's why i think that that situation at google is so important because it represents something much larger that's going on here and it's something that that the press is going to have to look at as well as all these companies okay one one more question why is it so easy to demo a thing that's evocative and so hard to turn that into a complete product that we engage with every day i think it's you know it's just about aligning the technology with the need okay that open ai rubik's cube hand right that is not a line with any need right we don't need that the trick is is finding you know where there's real gain and applying it right and i think that's where people you know often sort of miss the point right and they and you know these neural networks have worked and worked really well in particular areas right they don't work well in other areas you know there's all this hype around ai and sort of remaking how your business operates that sort of thing but that's something different right it's you know there's not always an alignment there's an alignment with that deep mind result right that is something that is a real need and they're going after it and and in one sense you know they solved it there's still a lot of work to be done but that's what we're talking about protein folding protein folding right the cast contest right that's something that the world needs and they're going after you know gpt3 it's not hard to be impressed by it but it's really hard to see where that's going to have you know the the practical application when you find where it works you know becomes much easier to show people right you know i think the difficulty is often just sort of a misalignment if that makes sense yeah no that totally makes sense all right well i think it's a good note to end on thank you so much that was that was a lot of fun thanks for answering all my questions thank you uh glad to do it and uh um really good talking to you as well yeah real pleasure thanks for listening to another episode of greeting descent doing these interviews are a lot of fun and it's especially fun for me when i can actually hear from the people that are listening to these episodes so if you wouldn't mind leaving a comment and telling me what you think or starting a conversation that would make me inspired to do more of these episodes and also if you wouldn't mind liking and subscribing i'd appreciate that a lot,9145 +Dave Selinger — AI and the Next Generation of Security Systems,https://www.youtube.com/watch?v=dSL9ttDARe8,3368,2021-03-11,"We have this 7-Eleven that gets burglarized literally once a week. Guy walks up with a crowbar and he swings at the door, and that's the end of our video because our guards get on and say, ""Hey, jerk, get out of here. The police are on their way,"" and the guy walks away. Whereas if you had a dumb camera, you get this really cool video that for the next 45 seconds, you see this guy banging on a window. So what we've had to do is we've had to train the market that like, hey, prevention is possible. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. Dave Selinger is the co-founder and CEO of Deep Sentinel, an intelligent crime prediction and prevention system that stops crime before it happens using deep learning vision techniques. Prior to founding Deep Sentinel, Dave co-founded RichRelevance, an AI recommendation company. Super excited to talk to him. Maybe you could actually start by just describing what your company does and how it uses deep learning. Sure. So Deep Sentinel is a physical security company. So we have cameras and we actually protect facilities. It's not cybersecurity protecting computers. We are kind of a competitor to ADT, which I'm sure you've seen the lawn signs for and the whole premise behind that business was, hey, ADT, doesn't work. A lot of people don't know this, but police departments across the country, basically don't respond to burglar alarms because they're 99% false alarms. So they just really don't help people protect their families. Maybe it makes you feel better for a little while, but it doesn't actually solve the problem. So I started looking at like what really wealthy people do, and they have people that sit in a guard shack and they watch cameras all day and they pay tens of thousands of dollars a month for that service because it works. The problem is obviously if you've got a bill for $25,000 a month, I would say there's a very small population that's going to write a check for that every single month. So what we realized was that if we use AI to both make the guards more effective, more efficient, and able to do their job, and then we use all the technologies that are available now to drive that into unit economics, where instead of having one guard or 10 guards protecting one house, we have one guard protecting 10 homes, then protecting 100 homes and now protecting 500 homes and that's really what our business is all about. Got you. So the vision systems are deciding which cameras are interesting to look at and then a human actually verifies that there's something going on. Is that right? Yeah, and it's raw. That's exactly ... We use the vision systems, which is where deep neural networks have made a lot of progress to choose what to look at. Then we're getting increasingly sophisticated with that to not only choose which cameras to look at, but where to look in the cameras. What are the areas of interest? Hey, if something happened five seconds ago, here are the references that kind of wrap around this event so that you as a human being have the full context, even in the course of like half a second. So how much of that is interface for the operator and how much of it is like intelligence? Where do you sort of invest your effort? That's a great question. So, because nothing exists like this, we had to build the company in a vertically integrated fashion. So we actually build our own hardware. Behind me we have our hub, which we actually built ourselves and that runs the AI in the customer's home. So we've had to invest a lot of effort. Vertically integrated business is like by far the most complex way to do it. So we have a lot of investments that's gone into AI, we've got a lot of investment that's gone into operations, but really the thesis of the business is, hey, AI can change all of those things if you use AI correctly. This is an AI-driven business because of the nature of what we do, which is video oriented. We can make our operations team smarter. We can make our customer care team smarter. We can make our engineering team smarter by integrating that throughout. Interesting. So what's the part that runs in the customer's home? You actually do some processing before it even goes to you? Yeah, I'll tilt this to the side here. So right there is our hub, it's right here in my office. Down here are the next generation versions of the hub, which we're working on right now. This is both my home office and our R&D lab seeing as COVID has helped us really concentrate our real estate effort. So what we run in the home actually is almost everything. So we've really focused on moving everything to the edge and there's a lot of work that's being done right now on edge processing. From what we're seeing and what we're building on that curve is just astronomically improving year over year. What we currently do is we run a version of Linux in the home that has basically all of our BI stack. It's all encrypted and whatnot, but we run almost all of the business logic in the home so that the decisions can be made in real time with the camera, and as you might guess in a security context, the difference between real time and 500 millisecond latency and communication is everything. So we're able to do stuff at just very, very rapid speeds by doing that. So what exactly is the business decision? Is there like a possible issue? Is that right? Yeah. So if you think about, if you've got a Ring at your house or a Nest camera and you actually turn on alerts, which I highly recommend you don't do, you're going to get like 1,000 different alerts. So the first decision that the AI has to make is, is this alert worth reviewing in further detail and to what detail? Then the second thing it needs to determine is what's in the field of view and is that worthy of taking the time of a human guard to review this? Then the third decision is when the guard is reviewing this, what are all the other pieces of information I need to share with that guard to make sure they can make an informed decision in the course of a couple of seconds? So again, all of that runs in the home on that hub, and then by doing it in that local situation, all that back and forth communication happens in sub 10 millisecond timeframes versus 20 to 500 milliseconds back and forth to the cloud. Well, that's really interesting. Can you talk about your hardware? Sure. So the current version of hardware that we're running is a Qualcomm Snapdragon 820. So it's what you'd find in a Samsung Galaxy S6. So if you open up our hub and you peel back all the boards, you're going to see something that is exactly shaped like this at the bottom of the hub and there's a bunch of circuitry around it, but it's literally kind of the reference design for the Samsung Galaxy S6 sitting in the middle and then we put all of our electronics around the outside of that. That was quite an adventure for me because going into this, my experience with hardware had been the robotics that I built in a lab at Stanford when I was there. That was build it once, use it once and make it do everything that it can. Then a bunch of Raspberry Pis and the types of stuff that you build in your house. That's a robot that I built of a BBH using a bunch of Raspberry Pis and Arduinos and stuff like that. So you think you know about hardware. Just very, very briefly, I will summarize all of my learnings in the last four years. You don't know diddly squat. If that's what you know, you don't know anything about hardware. So we really had to take the time to learn about how do we design that and manufacture that with high quality and solve a bunch of the problems. That last mile of real human beings is really important. I feel like all of my friends that do hardware, I'm a little jealous, because it seems so cool and they absolutely hate it and they like to complain about it. So I don't hate it. I would say though that the amount of learning that I experienced versus the amount that I expected is a ratio of about 10 to one. Whereas, for most smart people, you go into something, you kind of get a sense of it and you're like, maybe I'm off by 40% or 50%, or maybe it's double. Hardware was definitely a 10 to one ratio of how much I had to learn in order to get into market and be productive. Interesting. I wanted to ask you actually- It's cool stuff though. It's amazing what's happening in hardware right now though, I will say. Sorry to interrupt you. What's going on? The Google Coral Board is really phenomenal. The work that Nvidia has been doing is really great. I think my favorite thing though, is just that it's not all about Nvidia. So on the training side, it's still really like an Nvidia dominated world and you can get into China where in China, I see a lot of new R&D happening outside of the Nvidia world, but a lot of that's not really readily available to us here. On the mobile and edge side of the world, you've got everything from, like Rockchip has what they're calling an NPU, that is a neural processing unit and it's an accelerator. You've got the Google Coral Board, which is driving quantization, which I think is super sweet. That makes things much simpler from a mathematical perspective, but way, way, way, way faster and way lower power. You've got Nvidia doing their Jetson series, but I think overall, what I would say is that in the training side of the world, Nvidia is here and everybody else's here. On the runtime side of the world, Nvidia is here and there are people that I think that are better, significantly better on a cost for performance, and overall performance basis. So we're seeing a lot of innovation happening there and XP, which is another chip manufacturer. They're launching their own NPU in Q1 of 2021. There's just a lot of promise to see a lot of competition, which I think is going to drive a lot of improvement. Actually, how do you evaluate, why did you decide to go with the Qualcomm board over something else? How do you think about it? So we chose the Qualcomm board, great question, in 2018, and there were three primary factors. The first one was, does it do what we need it to do, which by the way at the time there were a lot of things that would say, like we do AI and they would come with like a pre-trained inception model. If you changed anything on it, it would break. The early days of runtime AI were pretty limited in terms of the scope of what things could do. So that was number one. Number two, was it's a consumer price point. So we're selling direct to consumer. So while my hub is, that's a $250 cogs piece of equipment. That cost me 250 bucks. If you buy an Arlo or a Nest or any of that stuff, you know this. The end user price is lower than that. So I had to really make sure that it fit within the constraints of something that consumers could afford, and then the third thing was it needed to be size-wise and power-wise kind of containable. So we couldn't just ship people an i7 with an Nvidia GTX card in it. That wasn't going to work either. Cool. One question I really am dying to ask you. So, I thought of you as a very successful entrepreneur. I actually have this memory of watching you pitch the CTO of eBay many years ago and just thinking like, wow, your presentation is so much better than my presentation. Just wondering, where did you hire this marcomm team that's making this stuff? So that's kind of how I always thought about you. So I was really surprised when I was running into some OpenCV bugs and then I think I found you at forked. Did you really find that? That's so funny. Yeah. I was like, did you like forked OpenCV or something? I remember just thinking like, is this the same guy that I know? That's not possible. What is he doing? I just was wondering, did you stay technical the whole time or was there like a moment where you got back into this stuff or what's the story there? Well, first of all, thank you for the shameful comment. I'm like literally red in the face here. I appreciate the compliment. I love being technical. I have always had this office in my home, which now is where I live, obviously 100% of my time, but I've always kept this and my wife has always been super supportive of me and it looks like you have something kind of similar in your garage in all the videos I've seen from you of just stay tinkering, stay busy, keep my hands dirty a little bit. There was a period of about three, four years at my last company, which was an enterprise sales company where I was traveling so much. I couldn't be super technical, but I took about a year and a half gap and started getting really, really technical again. What I found, what you found actually is that there was a version of CUDA, which is the Nvidia library that wasn't compatible with OpenCV. I ended up forking it. I ended up fixing the bugs so that you could run AI and OpenCV at the same time because if you were trying to run AI with OpenCV, you couldn't do it for like a year. So I forked it, I patched it and I figured out how to make it so that you could run AI with OpenCV on Linux, on the most recent version of CUDA and I was blown away. I did it myself and I was getting like hundreds and hundreds and hundreds of downloads every single day. I was getting bugs submitted to me on the main OpenCV branch and I was like, all right, well, I guess that's cool. It's funny that you ran into it though and for like a year and a half, I was the primary maintainer of OpenCV for all AI researchers around the US. It was pretty sweet. That's so awesome. I remember looking at your patch and just thinking, man, this is like really deep. There was a bug in the compiled C code that made it incompatible in terms of the data structure and some of the libraries with the most recent version of CUDA. At that point, were you already working on this new company? Was that what was happening? Well, I wasn't really working on the company. I was working on AI. In fact, I was really fortunate that I got to work with the guys over at Lux Capital. I went to them and just said, ""Hey, I'm going to take a couple of years off. I want somebody to bounce ideas off of, as I explore deep learning, because I think it's real."" Within a month I had built ... I was using genetic learning, but using a deep learning algorithm and some of the new stuff coming out in terms of vision. So I used a vision feature set using some of the ImageNet competition vision libraries, and then running just a stupid genetic algorithm to play Mario World. I built the world's best Mario World playing system in like three weeks and I was like, whoa, here I am in my little garage and I can beat the entire game of Super Mario Brothers. This has to be something. I haven't coded in four years and here I am with the state-of-the-art. So I went and I worked with the guys over at Lux and I said, ""Hey, can I spend a year just coming in and out of your office and bounce ideas off of you as to what we can do with this technology,"" because I want to make sure that we're thinking about the specific application in a way that can build a business. That for me, what was most important is that it really had an impact on the world. Make a dent in the world around us, because as you said, I'm fortunate enough that I can kind of choose the business problem I want to solve and while making great returns for investors is absolutely a necessary requirement. I had the ability to take an extra 18 months and make sure that I also did something that when I look back and I talked to my kids about it, that I'm really proud of and that they are proud of and they can be a part of. So we ended up spending that time trying to figure out is there really a business model here? A lot of people said security is a bad industry. It's one that's dominated by ADT. It's got all these old players and people buy based on brand and they don't buy based on capability, and I was able to convince them and myself that if you could do a step function in terms of capability, just completely change the game, make something available, going from reactive to proactive, entirely based on the use of this new AI technology that we can change the way people buy. That's the entire thesis of Deep Sentinel is, screw ADT. It just doesn't matter. Whoever wants to buy that can buy it, but if I can say, ""Look, you can go from having the police show up less than 5% of the time, eight hours after your house was burglarized to preventing crimes 95% of the time, isn't that a different reason to buy?"" So that was when I was getting technical again, to make sure that I understood what the underlying technology was about, and if I can add one more thing to that answer, my favorite thing I did in that whole period was I started a journal club because I wasn't technical enough to understand the state-of-the-art at that point. I'd been so long since I'd been in AI, like 12 years, 13 years since I was at Stanford. So I started a journal club and just by offering free pizza, all of a sudden I had these 20 people showing up together and just saying, ""Hey, that's a topic I want to talk about too."" We ended up building this little kernel of about nine, 10 people that are still going today, five years later. Every single week we meet and we read papers together and we debate things. Because they're all smarter than I am, for literally the cost of one pizza every single week, I was able to get them to read this paper with me and then explain it to me. I just pretended that I knew what it was about, and then they corrected all of the underlying nature of it. So it really kind of reminded me of the power to just bring people together and learn from each other. Again, I think it's been five years and the same group of people, we still meet every single week. We met last night and we read a paper on Facebook's new object detection algorithm. That sounds so fun. Where did you find these people that are smarter than you? Dude, I literally just post it on Meetup. I made this thing called East Bay ML. I posted it on Meetup and I started broadcasting it. Actually I did a little bit of Facebook advertising too, like for people that had Python as an interest in their Facebook profile, but I only spent like 300 or $400 and I got hundreds of people to sign up for this meetup. We now we're at the point where every other week we do a paper and then in the alternate weeks we do code and we literally like load up Colab and code together as a group. These are now, because we're in COVID, about 20% of the people we've never even met them. We don't even know who these people are, but they just show up and we broadcast it. For me, it allows me to stay technical while I'm the CEO of this company without having to kind of distract or bug my engineers. That's so cool. What kinds of coding do you do? Mostly Python and Colab right now at this point. I would say I do a little tiny bit on CHIP. So I have like a Coral Board and Nvidia board and a bunch of like CPU GPUs in here. So I do a little bit of that, but most of my work now is really in Colab just to make sure I kind of stay fresh on where are we at with neural network architectures and how good do they really perform in reality. The improvements in ML over the last few years, does it affect your business? Is it important to use the very latest stuff to make your company work? It's not important to make it work. It's important to make it better and in the words of somebody that I think is, again, way smarter than me, Kai-Fu Lee said that AI has moved into the age of implementation. It's no longer in the age of science, it's in the age of implementation. I would disagree with it a little bit. If you kind of look at the S curve of innovation, where you've kind of plateaued out in terms of the actual innovation, I think the neatest thing about AI is that we're at both ends of that curve simultaneously. We have implementations that are at massive scale, like Facebook doing facial recognition on all of your photos and at the same time in the last year we have seen BERT, which from a language processing perspective is a step function better than all the NLP that came before it. We saw AlphaZero come out, which I think is just a phenomenal leap forward in terms of reinforcement learning. Then just in the last four months, we saw the new dirt paper come out of Facebook, which is an object detection algorithm that is also, I think it's like 25% better than the predecessor at mean average precision. Then Google had a similar one called EfficientDet, but to see 20% improvements still in the course of a year, that's pretty amazing that we're both in implementation and in the research phases. So for us, what that means is that we have to continually be climbing that ladder because we get operational leverage, we get margin and we get operational effectiveness by continually researching how each of those different new technologies affect our business. Got it. That makes sense. Has it been an adjustment to go from working on maybe not like safety, critical applications to like an application where you really can't afford to make mistakes or do I even have that right? I would assume that in your world now the cost of an error is much higher than with a recommendation system or stuff you've worked on in the past. It is. I think though, at the same time, I took a card from Elon Musk's deck, which is if you're going to fail, fail early, fail fast, fail hard, and then get to the next plateau. Because my advantage in being a startup again, versus kind of the behemoth of a giant slow moving idiots of the world, not that ADT are idiots, but let's just call them idiots for the purpose of this conversation. That you only have really one advantage, and that is your willingness to just look the monster in the eye and fail fast and fail early. So yes, it is a mission critical product and we absolutely strive to be better than everyone else, but we also recognize that if we're going to make these mistakes and do this learning, we've got to do it in this phase of the business so that when we get to the hundreds of thousands of customers, we're at another plateau. I guess if ADT, if it's true, I wasn't aware of this. If ADT actually, when the alarm goes off, nothing typically happens, then maybe it's okay if occasionally your system has alarm go off and nothing happens, is that right? Or that's probably not what you put in your marketing content. I would not say that it's all right if we don't succeed, but I will say that even in the really, really rare cases where we don't operate as quickly as we would like, we're still a hundred times better than the next best alternative. We have customers where we've messed up and we have, and that happens but on average and in fact to the 99th percentile, they understand because they understand that when they went to market and said, ""Hey, who are Deep Sentinels' competitors?"" The Google result list is zero. There's literally nobody else that can do what we do at a consumer affordable price point, and nobody else that does it as effective and as at much scale as we do. So they are generally pretty understanding that like, I get that you're learning. I want you to protect my family. I rely on you. I can't put up with this, but I understand. Interesting. It's funny, I've tried to ask this question to a whole bunch of people and I think I've kind of stopped but I always wonder, with people working on autonomous vehicles or applications where there's really like, safety is a huge issue, it just seems like incredibly hard to know. I've looked at so many ROC curves in my life and pick where you go in the precision recall trade-off, but I'm so glad that I'm not in your shoes trying to pick the precision recall trade-off. Do you have anything to say on how you- That for me is a little bit easier. So on autonomous driving, you have to choose left or right. Faster, slow, which is actually a more hard question than ours, which is show this to a guard and allow the guard to review it. Because I have a trade-off. I can have more guards than I need, and I can show them more video and I can even show them trash videos for a period of time, because I can afford to do that. As long as I can maintain the fidelity of their operational behavior. I can actually just solve that problem with money. Whereas if you're choosing left or right, you got to choose one, you can't choose both. That's, I think a much harder decision that you really can't solve with money. So they have to go slower. I can go faster and just lose a little bit more money for a short period of time. Although you can't do that infinitely. So I feel like that logic would apply anywhere. So at some point you have to tolerate some amount of error. There's some amount of error you always have to tolerate with these systems. So how do you even think about that? There is. So the way that I think about it again, because our decisions, the core decision, the most dangerous decision that we make is binary, what you can do is you just perform it as an experiment. You operate at the lower threshold, but then you measure at a higher threshold and you just measure the gap here. Got it. Then you have a level of acceptable gap, and then the second thing that we do is we also measure that across the life cycle of an event. So because we're looking at events that are, let's say 25 seconds long, that's 500 frames. So I actually have 500 opportunities to make the decision to go wild with that event. So I don't have to make that decision at this point. Now, I can't tolerate it if it's 20 seconds late, but can I tolerate it if it's 300 milliseconds later? I generally can't. So what we've done is we've been able to drive this threshold where we keep that number pretty much zero, by the way, to be honest with you. Then we drive other dimensions of flexibility in our decision making, instead of driving the decision of I'm going to let this event go, because we don't consider that to be acceptable and we keep that number pretty much zero. Do you pass along your confidence value to the operator? Great question. No, because in general, what we have seen is that the confidence values, and you see this all the time. I'm 99% sure that my cat has a dog. Like it's not real information. I don't find, and the shape of those softmax curves are so sensitive and the number that comes out is not a true confidence number in the sense that humans interpret them. Like 0.6, which would turn into 60%, really doesn't mean 0.6. In fact, if you look at the shape of most of the softmax curves, they're very heavily weighted on the zero, between zero and 5% and between 90 and 1005. There's almost nothing that's in that useful range in the middle. So we tend to just trim it off because then if you try to normalize it so that it's distributed evenly between zero and 100, what you're doing is you're taking basically 0.9 to 0.91 and making that this huge section between 50% and 75% confident, and we find that it's not actually that representative, like the granularity doesn't really exist in the confidence, at least in the curves that we've seen. Interesting. Do you ever go back and deploy new models to your customers, like existing customer? All the time. We deploy new models on a weekly, monthly basis. Interesting. Will you train them on the customer's data? Yeah. So we train them both on the customer's data in aggregate, and then we actually have a patent on personalized training at the edge or a patent pending that we use the individual customer's data to fine tune the model in their home, and that's one of the neatest things about having the hardware in the home is we have a huge chunk of data in the home and we also have a model in the home. So there's a bunch of advancements that have been made in semi-supervised learning where you can refine these models at the edge. So we could do that both in terms of how we interpret the final coefficients coming out that fully connected later, as well as do that on actually retraining some of the weights in the model in the home. So it'll actually retrain on the edge and not phone home? Yeah so we can do some amount of that. We have tended to do that only on the fully connected layer or like maybe one or two layers and we've done a bunch of tests to see, like where does that matter. In general, we found that even just the fully connected layer, rebalancing the way to the fully connected layer is pretty effective at driving a massive shift in mean average precision, especially when it comes down to our problem, very specifically is we have stationary cameras. So a big part of stationary cameras in terms of a vision problem is identifying the background. So when you have hundreds and thousands and millions of images of what a camera looks like on average, you can develop a very clean sense of what the features are of the background, so that your background subtraction becomes even more effective across the board. Do you have any examples of customers that have kind of taken a camera and deployed it in some way that you didn't expect that caused the system to struggle? We have tons of examples where the interface with the wet part of the world, the human being part of the world is wackadoo. We have to have a very specific policy about what you do when you see naked people, for example. So we have customers that we have discovered that ... And you know him as well. So Sean Parker used to always say, ""The world is broken up into two categories of people, exhibitionists and voyeurs."" I'll tell you a product that has people watching your cameras, you definitely find people in that first category. So we have customers that will put their camera in their house and then proceed to treat it as if it were an adult rated YouTube channel. I'd have to ask, and maybe we should take this out, but ... What do you do? We turn the camera off and we apologize to the guard that had to watch it. So we have special features where that escalates immediately. We disable it. We send them a very nice letter that says, ""We noticed that you've installed your camera indoors. We have gone ahead and disabled that. Please verify when you've moved that outdoors and let us know,"" but it's a product that touches the real world. This isn't a Nest or an Arlo where it's a dumb device just recording. This is a device that is a live ecosystem. So we find all kinds of stuff, and this is one of the things I really love about the business that we've started is that we have this huge, deep technical investment in AI. Then we have all of these different real scenarios that it's being trained against that when you compare our AI against what you would get from a horizontal vendor that does image recognition and classification, there's no way that they've spent the time to say like, ""Hey, what does it look like when you're inside and there's a guy dancing in his underwear in front of your camera?"" Nest doesn't have that, really. They don't do that. So we've just solved a lot of these really neat technical problems that are at this interface between cameras and human beings that I think is really interesting. I guess this is outside of AI maybe, but I'm kind of curious as another entrepreneur to entrepreneur, it seems like you have this interesting challenge of proving to your consumers that your camera's actually better. I feel like a lot of things it's like, obviously better, but I guess in security, it's better is the absence of issues. how do you demonstrate that your camera's better? It's more of a marketing question in some senses. So one of the challenges that we have is that for the last 30 years, the definition of better for cameras is pixels per inch, or color resolution where what we did is we changed the game. The camera really is a dumb camera. Like our camera is great and it's as good as anybody else's, but it's not better as a camera. In fact, there are areas where the new version of Nest, new version of Ring or whatever are technically better than our camera, but none of them do what our camera does. It's more about the capability. So what we found is that we break that problem down into two challenges. We have to handle the matrix of like, are you better and are you worse? Then the second piece of it's really what you said, which is how do you deliver peace of mind, which is such a dangerous word from a marketing perspective. So what we've really focused on, we have the series of videos called the stopped videos, that show us just stopping crimes. Boom, repeatedly over and over. They're kind of individually boring because the point where the guy walks up to the front of a 7-Eleven, we have this 7-Eleven that gets burglarized literally once a week. Guy walks up with a crowbar and he swings at the door and that's the end of our video, because our guards get on and say, ""Hey, jerk, get out of here. The police are on their way,"" and the guy walks away. Whereas if you had a dumb camera, you get this really cool video that for the next 45 seconds, you see this guy banging on a window. So what we've had to do is we've had to train the market that like, Hey, prevention is possible. It's much harder to sell proactive in some senses. So we've really spiced it up and made it exciting and made these video series about it. That leads to the question, ""Hey, how do you guys do that?"" We now have an AI, an autonomous AI called AI Deterrent that triggers within 200 milliseconds of the AI detecting somebody suspicious at night. That's pretty sweet too, because we're literally intervening in these crimes even before it would get to a guard and that typically buys the guard another three to five seconds where the person's talking to the computer and then the guards like, ""No, man, I really need you to go."" Do you worry about adversarial attacks? Like someone figuring out your algorithm and then finding ways to go in undetected? I would think you might be too smart for that to really be an issue, but maybe that is. I think, yes, we think about it. I think about that all the time, because again, my business is security. So that's the way that you have to view the world when you're insecurity but from a business perspective, at the point that we have people really developing adversarial attacks, we're at another stage in our development. Got it. Well, another totally different thread that I wanted to ask you about, because people ask me about this all the time. You're a fairly active angel investor and long time, super successful entrepreneur. I was kind of wondering how you think about investing in AI companies. Like, what you look for there. So I have actually made the decision about four or five years ago to stop angel investing for the most part because I did okay on some of them, but I found that the people that are really good at angel investing do that full time and that I didn't want to do it full time. I find it really neat to meet with entrepreneurs and hear their story and like you. I like helping, I like being there for them and the investment piece was, in some senses, clouding that interaction just because I don't have enough money to invest in every one of the cool entrepreneurs that I see, and at the same time, I don't have enough time to spend time with a bunch of them. So what I ended up doing was I ended up joining a venture firm as a venture partner so that they could deal with the investment side of things and I could just really focus on the advising and talking to them. What I've generally seen are kind of two camps of AI companies, and I think this is what you always see with emerging technologies is you see the geeky technology oriented founder who really doesn't understand the business that they're in and they're just like super smart and they're endearing. You want to help them, and then you see the smarmy, sorry for all the guys out there at girls out there that are this, but like the sales person. That shows up and they're like, ""And the AI is just amazing. You would not believe it's using this thing called stochastic gradient dissent."" You're just like, ""Oh my God."" So after I pulled a knife out of my eye, I don't advise those companies, but I entirely focus on those founders that have that chutzpah. They've taken the risk to do something that they are not good at. I am going to go and start a business as a tech person, and I've got this crazy great insight and I find that to be much more compelling and interesting and fun for me versus trying to educate some sales schmoe on what's really happening under the covers. I guess you've watched a lot of technical founders become wildly successful. You've been doing this a long time and have had really front row seats to Silicon Valley. We should put your bio in the show notes, but it's impressive. Are there any patterns that you've noticed over the years of like, who you meet in that mode as like a technical founder? Because I'm sure that's a lot of the people watching this and listening to this and how they actually succeed, like which people are likely to succeed and then what they do to make themselves successful? So the two things that I think are pretty consistent is embrace crazy, consistently. The things that other people aren't doing, that's exactly the world that you have to live in. You can't do all of crazy, but if you live in just not crazy, then you're competing with them on their terms and that is consistently a path to failure. So you have to embrace crazy. Then number two is the technical founders that I have seen get wildly successful, they aren't necessarily really self-aware, but they're sufficiently self-aware to hire their compliment. That's either in like one really amazing person or in two or three or a whole team, but they find some way to remain the crazy person and have a team that embraces that and supports them and gets the other stuff done. So I want to talk about social responsibility, because I've seen just this market shift in the role of data in our society over the last six months and not necessarily in a positive way or a negative way, just a standout observation that if you look at the two or three big things that have happened to the United States and the world and in the last six months, the COVID obviously, and Black Lives Matter, both of those have been really hard for our society to get a hold of because we have built a society that is headline based, whether you blame it on Twitter or whatever, I don't care. I'm not going to blame anybody. It's just that we are a headline-based society and the problems of COVID are so incredibly complex. They are deep statistics, they are statistics that don't lend themselves and they're so data oriented. They don't lend themselves to a quick 40 word summary as to what's going on. The same thing with Black Lives Matter. You could find a black person that makes more than a white person doing the same job. Absolutely, and if I wanted to, I could write that article up, ship it off probably to Fox News and they would run it. I could also find a black man that makes $100 per hour, less than a white person at the same job, ship that off to MSNBC and they would run with it. That instantiation based, existence-based proofs are exactly the opposite of what statistical distributions are all about. A statistical distribution is about capturing a totality of a problem and COVID, I think brought the concept of an exponential curve to the masses in a way that, I think 16% of Americans get through calculus and statistics in high school, but now 100% of Americans have been exposed to the concept of flattening the curve and why an exponential curve is important. Then we got exposed to, unintentionally, what happens when you intervene in an exponential curve? Oh well, but that never was going to happen. Well, now we have intervention-based statistics that now, I'd say less than 16% of Americans are aware of, but statistics have moved from being a fringe thing that affected insurance and financial markets to being something that impacts all of our lives. I think that's something that is both exciting to me and scary because we, as data scientists, have generally lived in this world of like, I'm just living in the numbers. So whatever I say is just the numbers and I have no social responsibility, and I think that's absolutely wrong, that that is categorically an incorrect statement. In fact, I think the opposite is true. It's that numbers have become so important in our society that it's our job to go to primary sources. It's our job to educate our friends. It's our job to read past the headlines and play an active role of helping people understand what a distribution means to those things. I had a couple of examples, for me, that really stood out. At the beginning of COVID, there's a great blog called Towards Data Science. I don't know if you read that one, but I like it. I think it's pretty good. Totally. Terrific. There was an article written in Towards Data Science that said the cancellation of these conferences is absolutely the wrong thing to do. In fact, I'll just share it for those of you that are watching. It's this article here. He invented this term called Coronoia or like paranoia. Coronoia and no slight to the author because I think we were all learning. So I don't want to slight this author, but I hated the article. The reason I hated the article was because the paper that the article states, here's the population of Spain. It's 46.66 million people. The total cases of coronavirus in Spain is two. So therefore the number of people that would get coronavirus at this conference is 0.0046. What I really hated about this was that we had an opportunity to have an educational moment and have a conversation, and instead, what we did is we chose a false population. The population of Spain and the number of cases in Spain is not the problem. We simplified the problem in a way that abstracted out the exponential growth component and the exposure component and we stopped and didn't believe in intervention. We didn't analyze the actual data, and this is the danger of statistics to me is that you can use statistics to justify something. If you just trick people by choosing the wrong population, you trick people by choosing the wrong coefficients, you trick people by using the wrong underlying model. In this case, he used a linear model instead of a exponential exposure model. That to me, I think is the quintessence of what we as data scientists must tackle is to be ... Sure, we can be opinionated, but to be balanced and I think it's just so much more important than it ever has been because we, as a population, don't have the educational system, so that 100% of Americans understand what statistics mean. That reminds me that there's been, I think another observation that a lot of folks have made about machine learning algorithms that when they're ... Based on training data, that could have underlying bias in that data. We can easily end up with models that reinforce society's biases and make it worse and actually, you make a device that calls the police. So how do you think about that in your device, in your training data and what you do? Dude, such an important question and I just made this big statement that we have to be proactive. We have to recognize that if we just say, ""I'm agnostic because I live in the data. So therefore I'm not biased,"" is a falsehood. Mark Zuckerberg said that in front of Congress and I loved that he did that because it shone a bright light on the fact that that is a lie. I don't know that Mark did that intentionally because I think we all believed that right up until very recently. I don't think we were exposed to our own biases and our own potential for bias until that happened. What's interesting about that moment is I happened to be in Washington DC on that exact same day and I had called a meeting with the ACLU and the NAACP and a number of other civil rights organizations. I presented my business to them and they all said the same thing. I said, ""No, no, no, no, no, no, we can't be biased because we're based on the data."" What it did is it really forced me to take a step back and say, ""Actually, could I?"" Let me ask the question. Instead of making the statement that because I'm focused on the data, I can't be biased, let me make the hypothesis. Let me make the null hypothesis that states, because I'm based in data, I can't be biased and then disprove it. I found that there were just hundreds and hundreds of ways to disprove it. So we actually designed our system specifically to be race blind. We intentionally designed it to be raised blind. We manage the training data coming in so that we can identify people. Then the second thing we did was we used the data in a way that cannot be abused from a race perspective and an ethnicity perspective. Then the third thing we did, which I did not expect to do. In fact, I came into that meeting with the NAACP saying, ""We're not going to do this."" I said, ""We're going to not track the race of the people that we call the police on."" They said, ""In fact, I want you to. I want you to do that, but I want you to store it in a system that's not being used for machine learning. I want you to store it in a system that you use for auditing your employees and you use it for auditing your own business to make sure that you are doing the right things."" So we did those three things and I feel wildly good about what we did because one, we were proactive. We did it before we got in trouble. We did it before we called the police on the wrong person. Two, it was an educational experience for me. Again, I was very strongly in the camp. I was in a recommendation system business. I built the recommender at Amazon. Those systems are entirely based on, hey, you just use the data and whatever the data say, that's what you optimize to. I was surprised and pleasantly pleased with the results of taking that step back and saying, ""Let's treat that instead of as a conclusion, as a null hypothesis,"" and I learned a lot. It's a strong claim though, that your system can't use race. How do you know that's the case? So it's not that it can't use race. It's that we make sure that the distribution coming in on the training side is designed to be race agnostic. It's a fair distribution on the input side, which requires tweaking the distribution on the open side. Then the second thing is we don't allow the system to perform specific activities. We don't do facial recognition at this point because we recognize there are some issues with facial recognition and we don't allow it to call the police. We focus on classifying something that is itself, race agnostic. So we focus on behavior and we focus on classifying people versus cars. We do not focus on classifying suspicious looking person, and we've specifically designed the system to not be able to do that, if that makes sense. So the system looks for people, not suspicious people. That's right. Then if it sees a person- Then it identifies the behavior and it separates the behavior identification from the person identification. This is what I mean by actually designing it to not be able to do that. We do not allow the behavior identification portion, the suspicious identification portion to know what the person looks like. It does not have access to the pixels. It only has access to a completely removed representation of the behavior. So it's a set of vectors and features that have been completely ... Have entirely removed all the pixels. So it synthesizes the pixels into some information that then- That's right. So we created an intermediary data structure, which is here is this object, here is its classification and here are the dimensions of its motion and then we drew a hard line. That system does not talk to the other system that says, based on these motion vectors and this description of behavior, this is suspicious or not suspicious. Interesting. Again, it was kind of non-intuitive to do that because as a data scientist, I would say, I just want to go from here to here. That's going to be much more accurate, and from a mean average precision perspective, it might be, but it also is exposed in a real world context to having bias in the end to end but by enforcing this intermediary that gets rid of the pixels that might contain anything having to do with race, it makes sure that it can't propagate through. Then you collect data that tells you that it's not using race, you're saying, or how do- So what we do is we then collect data in an independent system that says, these guards were calling the police on only Latina people, or only on Asian people, or only on black people and making sure that we are auditing all chunks of the system against the racial distribution, so that they're acting in a way that is consistent. Is there any other groups that you worry about? Do you look at gender too or other aspects of appearance or anything like that? We do. So we track the race, the gender, and then generally the age as well, because age is a really important one. So one of the things that I learned in my meeting with the NAACP is that black males under 18 are frequently classified both by police and by witnesses as being adults and you need to treat minors differently. We have a social responsibility to say, ""Hey, man, I see you TP-ing that house. Get out of here."" We call the homeowners and say, ""Hey, there's somebody TP-ing your house,"" instead of, ""Hey asshole, I'm calling the police. You need to stop right now."" That creates, as we're seeing, the intervention of escalation creates a different outcome. The intervention creates the outcome. So you better darn well, choose that intervention based on real data and if at the end of the day, black males are specifically classified by people frequently as being older, then we got to train for that, we got to compensate for that. We got to treat minors as minors because their brain development is the same regardless of race. We have to enable them to develop their brains. Have you written about this at all? Is there some place we could point people that wanted to learn more about this? I'm just not sure we can cover all the questions that- I haven't. I haven't written about it as much as I probably should have. I actually just did that. I had that meeting with those groups and I just did that for the purposes of ourselves. I probably should, at some point, write about why I did that and what it means and what the social impact is of that. I did do a little letter to our customer base, when the Black Lives Matter movement was really taking hold and I said, ""Look, I want you to know what we've done, because I'm really proud of it and I don't want racism to exist in our business because it's so dangerous."" Because we live at this intersection between human beings and the police. We have a real responsibility to take that at a high level. I'd say 99% of our customers were like, ""That's awesome that you've done that."" You don't have to pick a political side to say, ""You've done the socially and business responsible thing."" Interestingly enough, we did get two or three customers that came back and were like, ""F you, you're making this up."" Wow. Which was surprising to me. So the hostility was like, you're making this up or we don't like that you're doing this? You're making up that you did this, you're making up that there's a real problem. There is no problem. Ben Shapiro told me that there's no problem. Right, right. Got it. At the end of the day, I very much strongly believe in the First Amendment and I love that people have their opinions. Again, this is where statistics come into play. There are no real statistics that say that our country's not racially biased. Like zero. Well, let me ask you one final question, maybe a little less intense than that one. In your process of going from a model to a deployed system, were then the surprising bottlenecks? I think everybody senses that it's hard, but what were the points where like this took a lot longer, it was a lot harder than you were imagining that it would be? So number one, first and foremost is specific operands on specific chipsets. So for example, Qualcomm implements this huge stack of operands and it implements them in a way that you can accelerate. So as long as you stay in that little pool, you're okay, but all the new architectures typically are using new operands. That's a big piece of the new papers is they're getting to state-of-the-art by using this new version of ReLU or this new version of a generalization operand between layers and more activation functions, other activation functions that are non-ReLU based. What we have found is that in the last three years in general, that that problem is being solved with a big baseball bat. Instead of being solved with a surgical tool and being precise, it's being solved with broad swing. So for example, Qualcomm has the problem that if you don't support it, it just crashes. That's sweet, good to know. Then you've got TFLite, which I think is a much more robust architecture and I like TFLite a lot, but what does TFLite do? Well, if you're using an operand that isn't supported on your particular architecture, we just move everything from that point forward into your CPU. So you go from a model, you make this tiny tweak to a model and it runs in 25 milliseconds, and then you make a tiny tweak and now it takes 750 milliseconds. It's not this kind of like nice smooth curve where things get reconnected in the middle. It's just, and it blows up and as you might guess, systems that are based on an assumption of performance between 25 and 50 milliseconds do not perform well when performance goes to 750 milliseconds. So I think we've improved in that like TFLite doesn't crash and totally die, but actually at the end of the day, your system crashes and totally dies. So you get the kind of the same outcome without having a system error, which is maybe better and moving in the right direction but certainly not there. It's definitely still a lot harder than I would like. I would say that the second thing that I've learned is how incredibly distinct the world of training is from the world of runtime operations. If you're running it in the cloud and you're willing to pay the GPU prices on the cloud providers, then it's not that big a deal. Any of the more precise architectures, the level of OS tuning that you have to do, the level of firmware driver management that you have to do, it's a lot more than I thought it was. Again, coming from the Raspberry PI tinkering type part of the world to actually implementing it and having it run 1,000 times an hour, every single hour, 24 hours a day for 365 days a year in all 50 states, that last mile was much, much more complex than I expected. Interesting. Cool. Well, thanks so much for your time. I really appreciate it. It was great to talk to you. It was great to catch up with you. I wish I could see you face to face.",10286 +Tim & Heinrich — Democraticizing Reinforcement Learning Research,https://www.youtube.com/watch?v=oYSNXTkeCtw,3249,2021-03-04,"What we see right now in the field is that there's lots of interesting reinforcement learning results that come out of industry labs that have a lot of computational resources. And that makes it basically impossible for any one outside, specifically in academia, to reproduce these results. And that was exactly the kind of motivation behind that environment and that it's really complex, but at the same time should be affordable for grad students and master students and whatnot to actually do experiments. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Tim is a research scientist at Facebook AI Research, and a lecturer at the University College of London. Heinrich is a research engineer at Facebook AI Research, and previously worked at DeepMind. Together, they built the NetHack Learning Environment, which is a super exciting project to make it easier for people to build and experiment with reinforcement learning algorithms. It also operates in a game called NetHack that I've played for the last three decades and so I'm especially excited to talk to these guys. I had been thinking for a while. I was wondering how well reinforcement learning would work on the game NetHack and then I came across your project, and you're actually making an environment where people could try different algorithms in NetHack. So maybe you could start by kind of telling me how you came to this idea, as a learning environment, and maybe describe what the game NetHack is. NetHack is this really old game that grew out of an even older game in I think the '80s or thereabouts. It's as old as Unix, basically. And it's this text-based game, Dungeons & Dragons style. The objective is to go down a dungeon and retrieve a certain item and then go back up and win and that kind of undersold it, because the actual fun is in interacting with all the monsters are picking up objects and then there's lots of in-game jokes and it's also generally quite a hard game. So I've been playing NetHack since I was quite young, I think about 12 when it was in a dust box where someone installed NetHack and I didn't really understand what to do. And then later on when I had the internet and there were some so-called spoilers, and I started being able to look at the actual source code of NetHack I started being able to do more in the game. It's really easy to die in NetHack, you don't have a second chance. You can save the game, but then it exits and when you go back in it picks up where you were. And it's a really hard and really fun game with a still active community a nd it's text-based, but still pretty complex. I feel like what's notable to me about NetHack is, I've probably played it more than maybe any other game and yet I'm still kind of surprised by things that happen in it. I still find myself looking up what's going on. It seems people even will kind of come to interesting ideas about how to use the objects in the game and the game will actually have kind of supported these sort of one in a million chance occurrences. So it seems incredibly deep, I don't even actually know how deep it goes or how simple it looks at first. I fully agree. So the reason why we believe this is an interesting challenge for reinforcement learning is exactly that kind of depth. It's as Heinrich mentioned, from the looks of it, it's a quite simple game and that it's a terminal-based; so everything is rendered as these ASCII characters in terminal. But in fact, it's so deep in terms of the number of items and the number of monsters that you have to learn to adapt with, there's always new things to discover. And on top of that it's procedurally generated, that means every time you enter the game, every time you enter the dungeon, it will be generated in front of you and it will look different from any other episode that you have been playing before. So that gives it also a lot of, I guess, replay-ability. And it's much closer, I guess, in spirit to more modern games like Minecraft. Where also every time you play Minecraft, the world is generated. And that poses very unique challenges to reinforcement learning because so far, well, for a long time we've been mostly using games like Atari games to test the limits of reinforcement learning agents. And that has been going on for a while and it has been good. But at some point I think people started to realize that in Atari when you, let's say, play Breakout or you play even Montezuma's Revenge; which is one of the hardest games in the arcade learning environment, every time you play that game it's the same. I mean, you can basically memorize sequences of actions through the game that lead you to win the game. And that's exactly what then approaches like Go-Explore by Uber AI, have been exploited to win the game. So I think it started roughly two, three years ago, when people started to look into these procedurally generated games. I mean, Minecraft is one example, but it's very expensive to render and expensive to simulate. But also, I guess the Obstacle Tower Challenge by Unity AI is another example of such a procedurally generated environment for reinforcement learning and then more recently, Open AI's Procjen Benchmark is another example. So people are looking more and more for test beds where reinforcement learning agents really have to learn to generalize to novel situations, novel observations. And we believe in NetHack is a perfect example for that because it's at the same time also really fast to run and really deep, much deeper than many of the 3D games that you could play right now. I guess I had never thought about this, but I'm also a huge fan of the game Go. And that game also feels deep, but it's depth seems to come from a lot of interactions with a small number of rules; whereas I imagine the NetHack code base just having this massive nest of case statements, it's almost like the complexity is intrinsic to it. I mean, both Go and NetHack, I think are kind of deep in the sense it's kind of hard for people to do well. But what is it about reinforcement algorithms that worked really well for Go and struggled to do basic things in the NetHack world? So Go is a really interesting case because the depth and the complexity of Go comes from the fact that you're playing against another player. So we should, first of all, state that NetHack is a single player game; you play against the game, you're not playing against another human. And obviously, if you have a very strong human you play against then that's a really hard game. But what makes this work for reinforcement learning is the fact that Go, as you mentioned, has very simple rules. So it's very clear for a specific action how the next state will look like and that allows you to basically exploit planning mechanisms, they allow you to basically plan ahead and think through what happens if somebody plays a specific move and then I play a specific move what will happen. And it's still really hard because there's this humongous observation space in Go already, because you have this 19 by 19 board and then on every title there could be a white mark or a black mark on a white stone, a black stone or no stone. But in NetHack, it's fundamentally different in that the transition dynamics that govern how a state evolves from time step T to the next one, extremely complex. First of all, it's partially observable. You don't see what's on the entire map; there might be a monster around the corner, but it's not visible to you. On top of that, it's stochastic. So every time you hit a monster, just to give an example, there's a dice roll in the back that determines really how much damage you incur. And on top of that, there's so many possibilities in terms of what could actually be on the tiles. So there's hundreds, as I mentioned, hundreds of monsters, hundreds of items, each of them come with all kinds of specific attributes and specific mechanisms that you have to learn about in order to do well. So it's really, really hard to plan ahead. It's also really hard, all the time, to even learn about all of these mechanisms; whereas in Go you can write down the rules easily in a program and you can consummate. I think there's another aspect of this comparison with Go and with MCTS-like algorithms. The NetHack community actually has done lots of crazy things outside of research and published papers, there's a few people in the NetHack community there, for instance. There's this altered org website where you have officially recorded games. And what you could do in the previous version of NatHack is that you would have your local NetHack on your own machine. And you would try out a few things and whatever you liked best you would do that in the actual online running game, where you basically have this perfect simulator; which is NetHack itself. Tim was saying that, how could that work? It's not a deterministic, it's heuristic. So the way people did that is that they had a map from all starting positions of the game, with your inventory and so on, to the seat of the RNG, pre-compute this in a few days and hours of compute; and then could look up the seat, see a new game, look at your inventory and that's enough entropy to tell you what seat you are in and then you know with which seats you're trying to initialize your local version and then you can actually beat the game in no time, because you have the perfect simulator. And then the NetHack DevTeam produced a new version of NetHack that makes this impossible, where you can no longer manipulate the RNG state by walking against walls or whatever, the way that these people did it. But it's comparable in a way to how you would do it, if you were just playing MCTS NetHack; you save the game and you try out what's happening and then you will go back to the position where you really were. And you could probably beat NetHack that way pretty easily, but you'd really only be NetHack, you wouldn't learn anything about reinforcement learning at large. And it's also really clear that for the community and also for us, that would be considered cheating. I mean, really you should be developing agents that can given a fresh game of NetHack solve the game. That's funny. I think I would be impressed... Yes, if you could see the random number generator and forecast ahead it would be much easier, but it still seems a little bit tricky. I feel like there's a fair amount of long range planning that you need to do. I've actually never won the game, so I don't even know. But I feel like even if I could see ahead, it might be hard for me to beat the game. It's still going to be super hard learning from scratch reinforcement learning algorithm. But what these guys did is that they, you basically can get infinite wishes; that's the thing in NetHack. In certain situations, there's a wish and then you can wish for any object and you can get it. And if you can force the RNG to always give you a wish and you. can get infinite amounts of wishes and you can always make this mini... When I played NetHack when I was very young, I did this thing called save scumming which you're not supposed to do; where you the save game and then you copy the saved file and then when you die, you go back and you go back to that point in time. And what you do with that, from a scientific perspective, is you force a really unlikely trajectory. All the games where you died and you didn't like it you threw them out and you go into this more and more unlikely space and at some point you really dodged all the bullets, but the game will just kill you a thousand times per round because you didn't repeat it. And I think this is what's likely to happen when you can force your RNG to be in a specific state, you produce this extremely unlikely trajectories of the game. When you take the sort of basic reinforcement algorithm from Go or just sort of like a vanilla reinforcement learning algorithm and then you train it on NetHack, what happens? What does the character do? That's exactly the thing that we wanted to see. I mean, first of all, you couldn't use MCTS from Go just because you don't have that environment transition model, you don't know what happens at the next time step given the current time step in an action- Actually, wait. Sorry. I need to step even back one step further. What are you actually even optimized for? I mean, in Go it's so clear that you're trying to win, but I don't think that makes sense here. That's a great question. It's a really great question. So ideally, we want to have agents that can win NetHack. And the way to win NetHack is to ascend to demigod by offering the Amulet of Yendor to your in-game deity. But the problem is that, that's a really of sparse reward, right? Yeah. It's like you have to solve it before you get any reward, so that doesn't work. Then there's lots of techniques right now for providing agents with intrinsic motivation. I mean, that's what basically keeps people like you and me playing NetHack, although they haven't won NetHack yet; we're just curious about finding new quirks and new interesting situations in NetHack. But what we basically did is we have a reinforcement agent that is trying to optimize for in-game score and that comes with all kinds of caveats actually, because you can try to maximize the in-game score by doing all kinds of things that are unrelated to actually winning the game. So for instance, you get score for killing monsters, you get score for descending deeper down into the dungeon, but that really doesn't help you to understand at some point you have to go back up again. Just to give an example. Also, people have been when they're really good, so meaning when they already know how to play NetHack really well and they solve NetHack, they start to give themselves all kinds of interesting challenges; and one is actually to solve NetHack while minimizing the score. So you can also do that. So it's not really a very good reward function, in a sense, towards the goal of solving NetHack. But I think it's still a really good proxy for now in order to compare how well different models or different agents do. So I think for now we're happy with that kind of setup, because we are still in a very early stage or the community I guess as a whole is in a very early stage when it comes to like making progress on NetHack. But I think eventually at some point we'll have to refine that a bit and the winning condition is actually winning the game. Got it. So you're optimizing for score? You can also optimize for gold or dungeon depth of these things, but typically you do try to optimize for score. Okay. So what happens when you put a vanilla agent in there? So what happens is quite interesting. So first of all, we thought when we started this project that just a vanilla agent wouldn't really be doing anything in NetHack, it's just so complicated. Just learning to navigate in the first dungeon level to the next dungeon level is already hard because there are all kinds of situations where you are in a room where there might not be any doors and you have to walk around the walls to find a secret door, which is actually quite tricky to learn. Then you might find doors but they might be locked and you don't have any key around, you have to actually kick in the door to even make it to the next dungeon. And we thought this is really hard for reinforcement agents to learn, because there's no reward attached to kicking in the door. Actually, it turns out that if you kick a wall and you hurt yourself and you might die, so that actually gives you negative reward or at least terminates the episode. But actually what turns out, and this is really interesting, is that if you train in these procedurally generated environments, what happens is that occasionally there's an instance generated of this whole problem that is really simple. The staircase down, might be just in the room next to you and the corridor might already be visible. So from your starting position, you might already see where the staircase down is. So your agent, even when just randomly exploring, might just bump into that staircase down and go downstairs and get a reward. So this is fascinating because it means with these procedurally generated environments, if you train for quite a number of episodes, there will be episodes generated that are quite simple and where the agent actually can learn to acquire certain skills to then make also progress on the harder ones. So this is one thing that we saw. So our agents right now, just by optimizing for score, they average at a score of I think 750ish roughly, which is not bad if you are new to NetHack. So if I take a random computer scientists in the lab and I asked him to learn about NetHack and play NetHack, I think it takes them a good fair amount of time to reach 750 on average as a score. I think the maximum score we've seen so far is maybe something like 4,000 or 5,000. They descend down to dungeon level on average five or six search. But we also see individual agents sometimes luckily, going down even dungeon level 15. And we see agents killing a lot of monsters on the way, because that gives them immediate reward. We see them passing by landmarks like the Oracle or Minetown even. So that was actually quite surprising to us, that the vanilla approach can already make quite steady progress on NetHack. So that's quite encouraging I think, for them building up all the extensions and more sophisticated models. Well, that sounds like a basic model answer, you don't have to tweak it at all to get it to that level. Yeah, it's a very straightforward model. I mean, the only thing that we do is that we have basically a convolutional network that encoats the entire dungeon level that's visible so far. We have another convolutional network that's centered around a seven-by-seven crop of the agent; so that gives it basically some inductive buyers that the things that are close to the agent are more important than let's say things that are very far apart. We have another feature representation based on the agents' statistics. And then all of that is mapped down to a lower dimensional representation that's fed into a recurrent policy parametrized by an LSTM, and then you get the action distribution out of that. So it's really nothing fancy at this point. Maybe we should have mentioned that it also does some bad things. If you optimize for score, for instance, it quickly notices that it has this pet with it in the beginning. And if the pet kills an enemy or becomes a monster, then you don't get the score. So what it learns, at some point of training it starts killing its own pet, which is really bad. Oh no, that's so bad. But it will do that. And the interesting thing is, it starts playing random games, then it starts killing the pet along the training. But then if you train for longer it stops killing the pet because it notices that killing the pet actually makes an in-game NetHack deity, mad at you, and bad things happen. So it will stop doing this after a while. It's kind of an interesting behavior. That's really interesting. Also I think we should mention that, from Tim says right now, you can kind of if you know the game of NqetHack you'll notice that we don't actually use all the inputs yet. So NetHack has this status bar and the stats of your strength and so on, it has the dungeon. But it also has this message and it has these little in-game windows that can pop up, like your inventory can pop up and other things can pop up. And that's actually like a research challenge of how to make use of all of this. And the other question is, what's the action space? A human can also just play, press capital S in NetHack and save game and exit and we don't actually want our agents to be able to do that. So you can not give it access to all the full keyboard as it were, and typically what we do is we restrict the action set. Will the agent know its own inventory? Could it pick up some food and eat it later? Kind of yes, but we don't have a full solution for that yet, because we would need to feed in that as a constant observation and we don't do that presently. It is hard to exclude the agent from doing this because different keys on the keyboard mean different things in different situations of the game. And in some situations, if you enable the eat action then you can eat some stuff, but maybe only those keys that are already enabled for the game. But it gets a little bit technically right now. Our agents can eat certain things in the inventory if it has the right letter, but not other things, for instance. Also, I think maybe it's worth emphasizing that right now we've been spending most of our time just building that NetHack learning environment where you actually do have, if you want, you have access to the inventory observation. You can, if you want, use the entire keyboard as your action space. So that's out there for everybody, if they want to, to use and we hope that lots of researchers pick this up and come up with all kinds of interesting solutions that make progress on NetHack. And then on top of that we have this really basic agent implementation that we mentioned here, we'll release that as well so that people can piggyback on. But obviously, there're a lot of open research questions of how to make best use of all these observations that come from different modalities, as well as really deal with this really large action space. One thing that I find super exciting is the fact that we as humans, we have all kinds of prior knowledge. When you play NetHack, although you've never heard about that game and you bump into a door and you have let's say 170 actions that you could apply like trying to drink the door or trying to sit on the door, you just don't do that. You won't even try this out, you know I can try to open this maybe if I have a key or if I don't, well, there's also this kick action. So maybe let me try to kick in the door. So this fact that we as humans are so amazing at using our prior knowledge, our world knowledge, our common sense knowledge to then really efficiently explore in these environments is absolutely fascinating to me. And that's why I also really like NetHack as a test bed for artificial intelligence, because I think ultimately we should have agents that are capable of transferring such domain knowledge from other sources to then be really efficient in these hard simulated environments. There's a concept in the NetHack community called source diving, where you look at the source code of NetHack and try to figure out how the game dynamics work. And ideally our agents should be able to do that. Our agents should look at the source code and be able to figure out how this game will behave generally in certain actions and then just do the right thing. That would be the perfect research agenda for NetHack. I feel like on top of that, there's this really amazing community created natural language research which is the NetHack Wiki. So almost everybody I know of who learned to play that hack learned that by also looking up things on the NetHack Wiki. As you mentioned, you started playing NetHack when you didn't have any internet connection. So you couldn't look at any of these kinds of spoilers. That makes it almost impossible, I think to make progress on NetHack. And even with the NetHack Wiki, it's really hard. So people sometimes play NetHack for 20 years before they first win the game. But this kind of resource is amazing. It's 3000 Wikipedia pages of explaining how certain entities, items, monsters work. And I think one direction is really exciting to me, and that's not really very different from what Heiner just described by directly looking at the source code. But what if we had agents capable of encoding information in the NetHack Wiki and using that to, for instance, explore more efficiently or avoid certain really stupid deaths? And yeah, just generally using that prior domain knowledge to be much more sample efficient and generalize better in that. It's funny, actually I think it's a kind of a different game. In prep for this interview, I started playing NetHack a little bit again, and I kind of couldn't believe that I tolerated this game without the internet. It's just such a frustrating game with such little guidance. And I was reading your paper on reinforcement learning where you're talking about building a system to optimize for learning... I forget how you put it, but sort of optimize for modifying the state space of the algorithm. And then I was thinking of my daughter, who's clearly doing that. So she's nine months old and I've just been watching her a lot and she clearly explores her environment in a way that she's just totally focused on whatever is novel. And there's no question that she's completely wired to if I show her a new toy, she loses it or anything that seems to defy her belief about the laws of physics, blows her mind. So clearly she's doing that. And then I was wondering if maybe myself as a child, I was kind of more willing or kind of more enjoyed the exploratory months necessary for figuring out NetHack. Yeah. That's a perfect remark. In fact, some of the research that we're doing is really centered around how can we design agents that are intrinsically motivated to learn in an environment? Because again, in NetHack, any reward function that we come up with, it's not going to be great. The actual thing we want to optimize for is solving the game and there's just not any reward function, I think that really can guide an agent step by step towards that. And I have two daughters and in fact, my youngest daughter as well, at some point was playing with a toy kitchen and she was just opening and closing the door until at some point she had even squeezed her finger in the door. She was crying. It was clearly something really bad. She was actually in pain. She was crying for a minute and then she was continuing closing and opening the door until it became boring. So this fact that we, as humans are just setting ourselves goals when we are in an environment. We get bored and then we think of, ""Oh, what happens if you try this or that?"" And then we see, can we actually control this? Are we empowered to have control over what we want to do? Are we able to actually predict what's going to happen next? And if not, then maybe that's really interesting or maybe it's noise, maybe it's just the environment being completely stochastic and there's just nothing I can control. So how do we design agents that can do this as well? I think that's a question that's super exciting to me as specifically in the context of NetHack, because it has this stochasticity. It has this is humongous, I guess, internal mechanism that governs the state transition. So I think this will lead to lots of quite interesting research. In a sense NetHack is really a hard case there. There's almost no human who plays NetHack unspoiled. I mean, typically people that don't have a good reason to do that because they need to find out about NetHack first. But the few people really, they weren't for instance were in the situation to try NetHack without any spoilers. And it takes decades. You dies so many deaths and you don't even know what to do. You don't even know what their exact goal of the game is. The game that kind of tells you like, if you read enough Oracles, but also there's a thing called rumors in the game where you can read up what it's supposed to do, but there's also wrong ones. And if you're unlucky, you get the wrong ones that misleads you. So there's almost no way to find out how to even beat the game, let alone get around all the obstacles if you don't spoil yourself. And we would like our computers to do that. But I want to mention another thing that Tim was saying, there's no reward that leads you to beating the game. That is true. But what there is, is recorded games in the NetHack community. We could just look at what humans do and try to imitate this. Have all of us play NetHack, which we do in our lab a lot. And then try to train an agent that predicts human actions and then go from there. That might be one option. I was going to ask you about that actually, because I remember the first version of the successful Go algorithm was trained on expert games. Have you tried to train an algorithm? I mean, I guess even an amateur NetHack player would probably... you could imagine that helps the algorithm learn some strategy, right? So we're definitely thinking about doing that. The problem is getting the data, just getting a few games isn't enough for the methods that we have. We need enormous amounts of data and there's no easy way to produce it unless we pay someone to play NetHack all day and even then you have to play for a long time. Now interestingly, the NetHack community actually does have record game say out the door, but unfortunately they basically only record the outcome of the game, like video stream what the game shows, they don't record the actions that were put in by the players. And that's the research question by itself, how to make use of this kind of data. But yeah, we have certainly something that we are thinking about. Has anyone built a kind of a rule-based system that can beat NetHack? That seems something like someone would try at some point. People try, but I don't think they were super successful. I think there was one system that maybe 10% of cases, or maybe Tim can ask the details on that end. Yeah. So if I vaguely remember, so there are cases of hard-coded bots that ascended prior versions of NetHack, where as far as I remember they used certain exploits in the game. There's something called pudding farming where you can I think get a lot of items or whatnot, and then it makes the game much easier. But these exploits, they are not in there anymore in the most current versions of NetHack. So all of these bots that have been handcrafted some sometime ago, they won't work right now. Also, I think, ideally you want to have systems that are able to ascend meaning win the game with all kinds of character combinations. I mean, you have different roles in NetHack; races and gender and whatnot. So these bots, as far as I remember, were always quite specialized for one specific role in NetHack. But ideally we want to have agents similar to humans that can in fact win the game with all kinds of standing conditions. So could you maybe describe your paper that I sort of alluded to in a little more detail. I think is the ICML paper, on the exploration and reinforcement learning strategies, and then maybe sort of say what the results were. I guess you were referring to the ICLR paper. Oh, ICLR sorry. Yeah, no worries. So first of all, that was a paper that was not done on NetHack. So that was at a time where the NetHack learning environment didn't exist yet. This is a paper done by Roberta Raileanu. She's a Ph.D. student at New York University and she was interning with us at Facebook research in London. And she has done a really good job at investigating the current limits of these intrinsic motivation mechanisms for reinforcement learning. So maybe just to give a bit more context, one really open challenge in reinforcement learning is how do you learn in environments where your reward that you get from the environment is extremely sparse. So reinforcement learning works amazingly, if you get a very dense reward function. So that means in many steps in the episode, you actually get a reward from the environment. But if your reward only comes at the very end and your episode is quite long, then it's really hard to learn from that. But what people have been doing in the past, developing all kinds of mechanisms that provide the agent with reward that's not given by the environment, but that is basically given to the agent intrinsically. And one such thing could be how well is the agent predicting the next time step given the current action, so that you could use that? If your agent makes a big prediction error in terms of given the current state and the next action, what the next state is going to be, then we are rewarding the agent. The problem with that is that there's this noisy TV problem, where in your environment, there's some source of stochasticity, let's say a television that just shows white noise. So every prediction that you make as an agent will be wrong because you can't predict what's going to be on the next screen. So you just reward the agent continuously for that. And that means that kind of noisy TV becomes an attractor to the agent. So the agent will just stand in front of the noisy TV all day without actually exploring the environment. So what Roberto was doing is she was putting on top of work that is trying to predict or calculating intrinsic reward based on the forward model, trying to predict the next state, but also given the representation of the next state and the reputation of the current state, trying to beat the action that led to that next state. So that's an inverse model. And what she basically figured out is how can we make sure that the agents internal representation of this state is only encoding what the agent can actually control in the environment. So if there's a noisy TV and the agent over time learns that it's actions don't have any effect on the noisy TV, then it would just ignore that source of stochasticity in terms of, or with regards to providing intrinsic motivation to the agent. And that led to at the time state of the art results on quite hard exploration problems in mini grid, again, being a grid world a bit like NetHack, but just million orders of magnitude simpler, but still really hard for contemporary reinforcement learning approaches. So that was that paper. I'm not super familiar with the literature. So let me see if I understood it. Maybe I'll channel the audience here. So it sounds like there's sort of a standard pattern of trying to actually go to environments where that you can't predict what the next thing will happen. And I thought you were going to say, it's wanting to optimize for being able to predict the next step, but that's showing my supervised learning bias, where you would probably want to optimize for good predictions, but you actually kind of trying to go to places where you can't predict the next step, which makes sense because more learning would happen. Yeah. So I mean, hope I get this right, because it has been some time ago, but basically you should be rewarding yourself, if you find novel mechanisms in the environment that you can control. But you shouldn't be rewarding yourself for novel observations in the environment that you can't control. Because if there's a noisy TV, you shouldn't be caring about that, otherwise you'll be standing in front of that TV for eternity. But yeah, you're right, there's always this kind of tension between learning the agent to get better at doing whatever it's doing in the environment so that will also lead to better forward predictions. But at the same time also rewarding the agent whenever it encounters the mechanism that it can control it, but that also leads to novel observations. Now the problem is that another common approach is to actually count how often the agent observes a specific state. And that has been doing really well for instance, in these Atari games where every time you play the game, it's the same, but in procedurally generated games like NetHack, that won't work. It's just so incredibly unlikely that you will ever see the same state twice that counting them doesn't make any sense if you do see one. So basically if you have a nosy activity and you can change a channel, we don't really know what to do with it yet. And honestly, that's how humans behave as well. So I think we're pretty close to AGI there. Yeah. No, I mean, it's funny. I mean, you alluded to this a little bit in the paper, but I was thinking some of the most joy I've felt is in NetHack, I really remember when you realized that you have a throw option and mostly use that to throw weapons and it kind of guides you in that way, but you can actually throw food at animals and turn them into pets. And it's this incredible joy of realizing this surprising thing that you can do. Clearly there's a reward function, at least in my brain of kind of discovering something new. And in your paper, you kind of alluded to some of this coming from education literature, or early psychology literature. Did you look at any of that when you were thinking about this? I mean, we have a paper together with Josh Tenenbaum who was, I think really leading in that area at MIT. I have to say, I'm not very familiar with that literature. I mean, that's the honest answer to that. But I think that the thing that you just mentioned in terms of, you know that you can throw not just weapons, but you, as a human you can also throw food around, you can throw anything basically around. And then realizing that actually in NetHack, the developers of NetHack, they thought of everything that you can actually throw food around. There was this revelation to me. I mean, I have to say, I'm not an expert NetHack player and our entire team, Heiner is the only one who actually ascended in NetHack. So I had this revelation the other day where I was playing NetHack. And then I was always encountering graves. And I was like, ""Okay, you go over this grave and you get some interesting message that's engraved on the stone. Okay, fine. But what do you actually do with graves?"" I mean, there didn't seem to me any use to it. And then the other day, I thought at some point when... Actually there're pick-axes in NetHack, what if I dig up whatever's is lying in that grave? And there's actually something in that grave. I mean, there's definitely a corpse, but they might also be items in there. Again, like for you, it was for me, so interesting to see that my kind of prior knowledge about the world also applied within NetHack, although it's this kind of terminal-based game. So that's, again, why I believe NetHack is such an amazing resource for artificial intelligence research. Okay. So we've probably driven away anyone with any kind of practical mindset, but this is supposed to be for people practicing machine learning for real world applications. I mean, where do you think reinforcement learning goes? I feel like the knock on it right now is maybe that it's really just for these kinds of toy environments, like Atari games and Dota and NetHack, is it being used for things that we would experience now, or is it on a path to being useful for things? Where do you think the applications are? Yeah. So first of all, I think it's not necessarily fair to say that the kind of research that's done in simulated environments is not with real world applications in mind. So it's very funny in that NetHack is this really old game. So it feels like a step back from more 3D, visually appealing games like Dota. But in fact, as we, I guess, discuss now, NetHack has a lot of properties of how you also would try to solve tasks in the real world. If I try to fix my car engine and I have no idea how to do this, maybe I can look up information on Wikipedia. I mean, probably it's not going to work, but we are so good at using what knowledge, common sense knowledge, and also acquiring a specific domain knowledge for solving tasks in the real world. So in some sense, I feel like NetHack is even a step forward towards actually making progress in real-world tasks with reinforcement learning. Also, given the fact that it's supposed to be generated and every time the observation will look different similarly in the real world. Again, a comp based approach won't really help you that much because the world will look different tomorrow. And at the same time, I think there's also more and more applications of reinforcement learning for the real world. So for instance, we published a workshop paper on using reinforcement learning for learning to control internet traffic. So there's these handcrafted heuristic that people have been developing for decades, TCP protocols and whatnot that govern how I'm going to... Sorry, for congestion control of window approaches to go and how I can maximize my foot put in an internet network; how can I make sure that I can send as many packages as possible without losing too many packages because of congestion of the other participants in the internet network. And we are developing approaches that allow us to train reinforcement agents to automatically learn what's a good policy in terms of sending out how many packets per second so that they maximize number. So there are definitely more and more applications of reinforcement learning in real world. Also, advertisement is I think an example. And so I think we'll see much more of that in the future. Yeah. I think computer systems, operating systems and sound, they have all kinds of inbuilt heuristics that are often good, but perhaps not optimal. And reinforcement learning is one way to try to optimize these things. If you look at the Linux Kernal, by the way, looking at NetHack source code is a great gateway drug to becoming a kernel devloper, it's basically a mini Unix in there. But if you look at the Linux kernel there's all kinds of heuristics and constants and wait times and so on. And potentially you could actually, not just hard-code these things, but learn them on the fly. Of course, you have a complex system, if you do that, and you may not want to do this at all times, but it's certainly an option and I think this is where the world is going. I want to make one more comment about NetHack. We compare NetHack to Go early on, but I think the comparison I like more is StarCraft. So StarCraft 2, has famously been a challenge and of course it's a multiplayer game, so it's different in that sense. But many of the challenges that StarCraft has are also in NetHack; a big observation space, complex environment dynamics, big action space, and all these things that are technically hard. But on top of that, to actually solve StarCraft you basically use up the energy of a small town. And to play NetHack, it's really cheap and you can do this in your university lab, on your home computer and so on. So that's far as one of the sales pitches for NetHack. As a reinforcement learning environment, NetHack is simple where it counts but hard where you want it to be be. So it's fast, but hard. And often, we have the trickiest part in the squadrons. Often, there's reinforcement learning environments that are complex but easy. So as everything is 3D finely rendered, but the actual policy you need to execute it's left, left, right and you're done. I guess what makes for a hard reinforcement learning challenge, it seems to me having to sort of save some state to use a lot later it seems to be challenging. It sounds like you do have a good intuition for what games would be easy for reinforcement learning and what games would be hard. So the thing that you just mentioned that's one, long range dependencies. How do you memorize or how do you remember that may be on the first level of NetHack you dropped a certain item that you've met much later or whatnot. And actually, NetHack has these very long range dependencies. Normal play of NetHack, if you succeed, is maybe on average 50 to 100,000 steps. There are expert players who can solve NetHack in 20,000 steps, but that's still an order of magnitude longer than for instance a normal game of StarCraft 2; which goes on for 15 minutes but has only a few actions per second, so I think average is around 2,000 steps. So long range dependencies is one. Then the question of exploration. So how easy is it for the reinforcement learning agent to discover, what it can do in the environment? How it can control things in the environment? How often does it bump into reward? Another question I guess is; do you have all the information that you need to, given in the environment itself, in order to do well in the environment or do you have to have a really strong prior based on your, as I mentioned, common sense, knowledge, world knowledge or domain-specific knowledge? If you have a very large action space that's really problematic for current approaches, but we as humans do well because we prune away lots of that action space. Can you easily plan ahead? Is your environment fully observable or is it only partially observable and you have to actually infer what's going on in the hidden parts of the environment? So these things make games or environments hard or easy for reinforcement. It's funny. As you were talking, I mean, did you guys notice how I tried to kind of steer this towards general topics but I wasn't able to? But since we're back in this NetHack topic, have you thought about my other favorite game, Kerbal Space Program? Are you fans at all? Have you played this game? I mean, I've seen that on stream, I haven't played it myself; but I think that's a really interesting example. Again, as I mentioned, I haven't played this I only watched the trailer. The fact that we as humans can build mental models of what should work and what shouldn't work and then test them, I mean, that's I guess what you do in that game. You have an idea of what might work out, in terms of a rocket that can fly, you build it then you see it fails and then you make modifications. Again, you plan in your head what kind of modifications you want to make, you make them and then you see again. This kind of way of experimenting in an environment, I think that probably sounds quite interesting for our reinforcement challenge. That said, I haven't played it myself. And I'm pretty sure current approaches would struggle a lot. Can I ask you, is there anything just practically that changes when you're trying to train reinforcement learning algorithms, if you're kind of used to more supervised learning algorithms? What's different about that kind of training? I think there are some engineering challenges to enforcement learning. Basically, reinforcement learning, you can make it look like supervised learning but the data comes from... you generate the data yourself. As opposed to just reading photos from this view, you generate the data yourself. And this is actually what modern reinforcement learning systems like say IMPALA or various others do, they have this part of the system that produces the data and then a part of system that learns on the data and there's all kinds of engineering challenges around this space; asynchronous processes and data communications on. But apart from that we use PyTorch, we use standard tools. You have to have to compute. Typically, the games run on the CPU, so you have to have more CPU and run the reinforcement like a machine learning code runs on accelerators like GPUs. But once you have that in place, it looks pretty familiar. The models look familiar. They input a picture or like a game observation output as its probabilities of certain actions. So there's one additional thing I would want to mention and that also relates I think, to Weights and Biases and it's that; in reinforcement learning, generally your results have much higher variance, so you can train an agent once and then you train at another time and the results might actually look quite different. So you have to be careful in terms of how reliable your results are when you only train based on one run, basically. That makes it interesting in terms of how you should plot these results in publications. I mean, ideally, you should be repeating your experiments multiple times, and you want to plot maybe the mean of the different ones, and you also want to indicate the variance to some extent. But I think in publications, we've seen all kinds of tricks of how people make results look better than they actually are. I mean, how do you even think about reproducibility of reinforcement learning results, if they're inherently stochastic? I think it's fine as long as you make sure you train with different initializations of your model multiple times. And then that really comes down to a question of, how expensive is it to rically in academia, to reproduce these results. Again, sorry to mention NetHack, but that was exactly the kind of motivation behind that environment and that it's really complex but at the same time should be affordable for grad students, masters students, and whatnot, to actually do experiments with. I hadn't thought about that, that's such a great point. But still actually, you do need quite a lot of resources to even do NetHack. I was saying, you built some kind of system or you're using some kind of system to train in parallel, right? Yeah. But you can run this on a single box with say two GPUs, so you'll just wait a little bit longer. For NetHack, we don't currently use like hundreds of GPUs in parallel, we could do that but we just haven't invested in engineering hours to do that properly. But you can actually run this at home. I mean, you could even run this on your MacBook if you wanted to wait long enough and make life a little bit hard. Depends on what kind of neural networks you would apply to NetHack, but this is actually something you can do at home. And actually, I mean, you even can do this really well with just one GPU. I think we have, our implementation of our agents is based on Torch-based which again is based on IMPALA and we have two versions of that; one is, training based on one GPU, we have one that's training using two GPUs. I mean, just with one GPU, you can do experiments, you can write papers on NetHack with one GPU; I'm quite certain of that. Cool. And there's basically just playing the game over and over and over and then updating the model? Yeah. We have this line in our paper where it mentioned how many agents died in the process, and it's a large number. Probably, by now it's played by far more games than the rest of mankind combined. Have they really not found flaw NetHack had to explain? It's kind of amazing to me that there's not some tricky way that you can live forever or something. Well, our agents actually haven't explored that large part of the game yet. We are really at the beginning of the research here. People have tried what is called test tool automated speed runs with NetHack and have found exploits, some of those ones that Tim mentioned, putting farming and so on. But the DevTeam NetHack, they kind of keeps track of that and removes these things one by one from the game. So NetHack by now is pretty resilient against these kinds of exploits. Are you in communication with the NetHack DevTeam? We did reach out to them at some point and they were very kind. That's great. So NetHack has been under development for over 30 years; as I kind of mentioned, there's been a lot of effort in kind of removing all of these kind of exploits. Right. Okay. Sorry. What about just really the weeds question. Does the agent have a preference, in that I can kind of go down the normal levels where you meet the Oracle and stuff or you can go down to that Minetown? Does the agent kind of learn that one path is safer than the other? I always kind of wonder which I should go to first. It's a great question. That's exactly the kind of high-level planning that our agents right now are not capable of, so it's basically by chance. So sometimes they just follow the main dungeon, the Oracle, they even get to the big room or even further down and then at some point die or they go into Minetown and at some point. We haven't really seen agents being strategic about first making some progress in the main Dungeons and then going back up to the fork to then go down the Minetown to get items and whatnot. But I mean, that's really one of the next I think milestones that we should get to. It's also because our agents have a really hard time remembering things long-term. The first order of transformation our agents optimized the current situation without any regard for the past. So going down Minetown; if you go down any stair and you happen to enter the Gnomish Mines, which is the special dungeon branch in NetHack, the logical thing for you to do is to kill the monsters in the vicinity, not to go back up just where you were where you already killed things. So if you optimize for really short-term things, that's how you end up playing and that's what our agents do. That said, we have seen our agents go back upstairs and we're not quite sure if this is just random chance or if this is something where it got incentivized to not play certain levels, but that's where we are. All right. Well I'm really excited to play with you. I'm even more excited to play with your NetHack Dev environment, I really want to give it a run myself. I always end with these two questions, I kind of wonder how they'll work in this format. But do you have any kind of underrated aspects of reinforcement learning or machine learning that you think people should pay more attention to than they are right now? You mentioned really fast environments. I think on top of that, in my view, people should be looking more into causality. I mean, it's something that I'm not very familiar with, but I think in terms of making progress as a community we should be looking more into causal models because essentially that's also what you are learning when you're playing NetHack over and over again. At some point you have some mental causal model in mind, ""If I do this, then that happens,"" or at least with some probabilities something happens. And I think that's the only reasonable way we can go forward in terms of agents that really can systematically generalize to novel situations, you have to have that kind of abstract mental model in mind that you use for planning and for exploration and so on. One thing that bugs me a bit about the research and machine at large is that, we make these artificial distinctions between this is engineering and this is research where if you want to fly the moon; is that research or is it engineering? It's kind of both. And I think in particular it's especially true in reinforcement learning, where the breakthroughs that we saw recently came to a large extent from engineering breakthroughs. I totally agree with that. And that's actually a good segue into the last question that we always ask which is, we usually frame it; what's the biggest challenge of machine learning in the real world? But I think maybe for you two I'd be curious, what are the surprising engineering challenges of making reinforcement learning work that you wouldn't necessarily know as a grad student doing your first toy reinforcement learning project? I think, I mean, maybe we should make this clear. What we do when we're training reinforcement learning agents in modern approaches is we have dozens or hundreds of copies of the game running simultaneously, played by the same agent and then something needs to ingest all of this information. So I'm not sure if people are aware this is how it is. People used to think of it like, this is the world and this is my agent and my agent connects to the world and there's only one world, obviously. But things are just so much faster if you have a batch of worlds and you interact with a batch of experience. Although, that is kind of bad news for all the comparisons to how humans learn and how real biological systems work. I think on top of that, I would encourage people to really look at what these agents or what generally your machine learning model is actually doing on the data. So it's I think quite easy to try to chase some leader board numbers or try to chase better scores on NetHack without actually understanding what your agent is capable of or not capable of and how that informs your modeling choices, modeling decisions and generally, your research or engineering work going forward. And so one final question, mainly for Heinrich I think. So for someone like me who has been playing NetHack for almost three decades and never ascended, do you have any tips on how to improve my NetHack skills? I think there's one point in NetHack where you ask a special shopkeeper in Minetown and it tells you to slow down, think about it. You have as much time as you want to do any action like, so NetHack is turn-based. So I think this is the best approach, think clearly. But it's really not human. You see this bad dragon and you want to run away from it, but there's no need for speed in that sense in NetHack. Just thinking clearly about every step is this the best approach. Yet so hard to do. And read the spoilers. Awesome. Thank you so much guys, that was super fun. Thank You. Likewise. Thank you so much for the invitation.",10327 +Daphne Koller — Digital Biology and the Next Epoch of Science,https://www.youtube.com/watch?v=prGz_6Jb16M,2779,2021-02-18,"I'd come from a family where I was privileged in that both of my parents had access to higher education, and I saw how much opportunity that created for me, that others just didn't have. And I guess I've always felt, and still feel, and really try and teach my children as well, that for those of us who have been privileged, too much as expected and it's our responsibility to give something back. So that was, at that point, my way of giving something back, is by teaching. And in fact, that was what led me to also eventually depart Stanford, because I felt like my opportunity to give something back to the world in a much greater scale was available to me by founding Coursera and opening up education to a much, much, much larger number than I would ever be able to teach at Stanford. And that's actually also what led me to insitro, because I feel like there's an incredible moment in time now in bringing together two disciplines in a way that could be totally transformative to the world. And I think it's kind of an incumbent upon me. There's almost a moral imperative to make that happen if I can do that. And it's not something that many other people can do. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host Lukas Biewald. I'm excited and maybe a little nervous to interview Daphne Koller, who is a very famous successful machine learning professor and also the founder of insitro, and a founder of Coursera, and recently founder of Engageli. Three super different, super interesting startups. I should also say she was my first machine learning teacher at 221 at Stanford, and then I TA'd for her. And then I did research for her later on. So she might actually be the reason that I'm here today recording this podcast. So again, super excited to talk to her. The thing I most want to talk to you about actually is insitro, which looks like super fascinating and exciting. And maybe for those who haven't heard of insitro, you could sort of give us a quick overview of the thesis of the company. Sure. insitro a drug discovery and development company. And if you've been looking at drug discovery for the last 50 years, you will see that we've made a tremendous amount of progress in bringing medicines to patients in nee. But at the same time, there's this thing called your Eroom's law, which is the universe of Moore's law, in which there is an exponential decrease in the productivity of pharmaceutical R&D. And when you ask yourself why that is, it's because the journey of discovering and developing a drug is really complex and long. And there is many places along that journey where we can take the wrong turn. And when we do it, can take months if not years, and millions if not tens of millions of dollars to realize that we took the wrong trajectory. So we're trying to do is to really build the company in a way that uses machine learning, which, after all, is something that is helping us make really good predictions in so many other domains, and use that as a way of building this drug discovery and development process in a completely different foundation. So that's what we're really trying to do, is bring better medicines to patients and do it faster. So what's the standard drug discovery process at a high level? And where does machine learning fit into this or improve it? So I don't know if one can really talk about a standard journey because it's been an evolving process over the last few years. If you want to draw a very coarse-grained caricature, you can say, ""Well, I have a disease and I do a, usually..."" That's done in an academic center, a bunch of biology to uncover the genes and the biological mechanisms, pathways, that are implicated in disease. And then someone has a hypothesis about, okay, if I make an intervention at this gene, it may cure, or at least help address, cure is a very broad word, very ambitious word, we've cured, precious few diseases, but to help address some of the aspects of the disease. And once you have that target, you can start to identify... Well, first of all, you have to validate the target. And oftentimes that's done using animal models that attempt to simulate some aspects of the disease. And for many of the diseases that we have today, the animals don't get the disease naturally. And so you kind of have to create the disease in the animal and then try and address it in the animal. And it oftentimes turns out that what you're addressing really isn't the true disease at some simulation of it that is very imprecise and sometimes just downright wrong. And then, once you have a target, then you typically look for chemical matter, a compound that helps modulate that target. And there's different, what are called therapeutic modalities, which are different kinds of interventions. It used to be, whatever, 30, 40 years ago, that the main form of a therapeutic modality we had was small molecules. And then around came biologics, which are larger molecules. Basically proteins and antibodies, which are a type of protein that are, in many cases, more precision mechanisms. So they're much more precise in their action, but they're also harder to administer and they are able to address a narrower set of targets. And now over time, we have additional therapeutic modalities that have emerged over the last two years that help intervene in the body and other types of mechanisms. So everyone's talking about gene therapy as they should, in which case we can come in and intervene in the DNA itself. There's only a very few of those that have been approved so far, but it's very much a growing field. Now with the COVID-19 vaccine, everyone is talking about RNA therapeutics, which is intervening in between DNA and protein at the RNA level. So all of these are ways that are expanding our capabilities to make intelligent interventions in the human body and hence in a disease process. Oftentimes, where it fails is really at the very beginning, which is, we do not understand biology well at all. And therefore our ability to recognize when intervening in a target is going to actually have meaningful clinical benefit to a human is very, very limited. And oftentimes, we guess, and we guessed wrong. And sometimes we also fail to understand all of the other implications that an intervention in a given target might have. For example, all of the other things that this particular gene does in the body. And if we intervene in a way that maybe even beneficial for this, it might be detrimental for that. And so that's where a lot of our ability to make valid predictions really falls short. And that's where a lot of drugs fail. And right now, the failure rate, depending on what you consider to be the denominator, like when do you start counting a program as a drug program, is between 90 and 95%. That's the failure rate, not the success rate. Which means between one and 10 and one in 20 drugs actually go on to be approved. And even smaller number actually ended up making a real difference to patients. And that's what we're looking to fix, is how can we make better predictions, first and foremost, about what kinds of targets you would want to intervene in for a given disease in the context of a given patient population. And then subsequently, fine, we want to intervene at this target, what is the right chemical matter to put in that might have fewer side effects, that might have better drug-like properties? What is the right patient population to use? A lot of the failures that I think we have today are because we try and go after a much broader or miscalibrated patient population. And so over time, I think there's many questions in this process where machine learning can make an intervention, the target, the drug, the patient population, the biomarker that tells us when a drug is working so that we can cut things short. If it's not, then transitioned the patient to another drug. All of these are areas where I think machine learning can play a role. And does the machine learning try to kind of model the physical reality of the world here? Does it ignore that and just sort of look at past experiments that were tried? I think people have tried both. And as we've seen in other cases where machine learning has been applied, there are some benefits to incorporating a lot of prior knowledge about the world, but then, over time, that begins to become a limitation. So I used to work in computer vision way back when people still tried to create models of how light is refracted off of surfaces, and having geometric models for computer vision, and models of elimination, and so on and so forth. And we don't do that anymore. What we now do is create really, really large training sets and give the computer enough data that it can learn the patterns without having to be told a lot about the structure of the world. We haven't quite hit that tipping point in most biological problems because the data that's been available has just been insufficient. And so right now, there's a lot of problems where models that incorporate more of our understanding of biology are actually, in many cases, outperforming models that are less informed. But one of, to my mind, a real highlight achievement from the past year that starts to go in the other direction is the incredible success of DeepMind's AlphaFold algorithm which uses somewhat similar machine learning tools to AlphaGo, which they used in a very different domain. And AlphaFold is basically addressing the problem of protein folding. So to take an amino acid sequence that represents a protein and ask what it will look like in 3D space. There's been multiple groups over the past, I don't know, 10, if not more years that have built computer tools. Some incorporating machine learning, but certainly all incorporating a relatively large amount of prior knowledge about physics, and chemistry, and forces, and electrons, and so on and so forth, and asking what the folded protein would look like. And all of them asymptoted at a certain level of performance, which was reasonable, but not usable. And by the way, I forgot to say that there has been an biannual competition once every two years called CASP, which is one of the best-designed real blind tests for machine learning model, one where you can't cheat. In which labs that are experimenting on a particular protein by generating its crystal structure, which is the 3D structure, would submit the sequence to the CASP competition and they would not release the solved structure until the competition was done. And since no one can... It's months of experimental work to come up with that structure. People couldn't cheat on the test data. So in this CASP competition, you could see that there was a plateau of performance. And then this last year, DeepMind really broke through that plateau and achieved a performance that is actually usable for... And achieved a performance that is actually usable for a real biological problems. And the way they did that is by not incorporating into the model a lot of preconceptions about physics and chemistry and different kinds of chemical bonds, but really just giving the machine learning model enough pairs of sequences and soft structures to train on. And then they said, ""Okay, now that you've learned, go and run on a new protein."" And they were able to break through that ceiling that we've seen. So I think, to my mind, that's an indication that we need to be really thinking hard about how to generate enough data at scale for biological or chemical problems, so that you could get machine learning to break through that ceiling and performance. And so that's kind of what we're trying to do at insitro is build massive data production capabilities across the problems that we care about, so that we can generate data that's enough high quality and large enough, and that is fit to purpose, so that you can train machine learning models to solve the problems that we care to solve in the drug discovery process. So, I guess, I want to get back to insitro in a second, but since the protein folding thing was so high-profile, I'll ask you my dumb questions, which is such a waste. But I was kind of curious, What was the insight then? It seems like just actually removing prior beliefs from a model wouldn't be enough to have a breakthrough improvement in the quality. And surely, lots of people had access to lots of examples of proteins and how they fold, right? So I can't speak to that yet because they have not yet published their latest model. And so we're relying on the very limited information that's in the press release. And so I would be curious to read the paper once it's out. But I do know that they incorporated a lot of insight from the latest machine learning models, in terms of, for instance, attention models that you can look to see where you would want to have one amino acid look elsewhere in the sequence to figure out where to fold. But I wish I could give you more insight into exactly how this works, and I'm hoping that they will publish the results soon and we will all learn from how they did this. And is protein folding, is that a sub-problem of one of the problems that you mentioned? Or is that just an example of how much momentum there is in the field? I think people have differing opinions on the extent to which protein folding matters in drug discovery. I think there's a lot of proteins where the structure is actually pretty well understood and we just don't know how to drug them. Protein folding certainly doesn't help you with the fundamental question of picking the right target to go after, because the folding comes after you've decided that this is a target that you need. There certainly are a set of targets where you really would like go after them, and what's missing is an understanding of their 3d structure. How big that set is, I think, is a matter for debate. So to my mind, it's less about whether protein folding is the key problem in drug discovery. Certainly not the key problem. It may be a problem, but it's certainly not at the core of what is holding drug discovery back. But it's really an illustration of taking a problem that everyone agreed was hard. People had struggled to solve or tried to solve using a range of other methods. And machine learning came in and, with the right type of model and the right type of data, was really able to crack that nut open. And so that, to me, is the real lesson here, rather than we've transformed drug discovery. Interesting. I guess another question that comes to mind is, I remember back in 2004, you were working on applications of machine learning and biology. And some of them actually sound quite similar to what you're talking about at insitro. And so when I started the company, almost two decades later, is it that the biology has improved or the machine learning has improved or the data has improved? What's the key thing that's changing that makes insitro possible now? It's a combination of both, actually. The first is the availability of much, much larger amounts of data than we have had before. So in the last decade or so, there has been this tremendous amount of progress in biological tools that are good for data creation. And that includes everything from the incredible growth in the feasibility of DNA sequencing, and not just DNA, but also RNA sequencing and various other aspects of sequencing. Microscopy has grown a tremendous amount in both its throughput and its capabilities. In the chemistry side, we have these really cool things called DNA encoded libraries, which are basically chemical libraries that can have hundreds of millions of molecules all mixed together in a test tube. But because they each have a DNA barcode attached to them, you could basically figure out what they do without... Even though they're all kind of mixed together in a pool. There's microfluidic techniques that allow you to do experiments in teeny little droplets, which achieves both spatial separation, as well as scale. All of these techniques are things that didn't exist a decade ago. Oh, and not let me forget CRISPR, of course, which is the ability to now start to edit the genome in a very fine-grained way, and then ask what happens to a cell when its genome is edited in a particular way. That is something that when I was doing, even not in 2004, I went and did a sabbatical at UCSF in, I think, 2009. And we were doing these experiments in knocking pairs of genes in yeast. And yeas is a very malleable, editable organism. And the experiments were incredibly slow and painful. And they were in yeast, which has 6,000 genes. Now, if you want it to do pairwise knockouts in human cells, it's an experiment that you could do in a couple of weeks. And it's just amazing how things have changed that way. So I think that, to me, is actually the biggest transformation, but the other one, of course, is just transformation that we've seen in machine learning. It's hard to imagine thinking back, but in 2004, when we were doing computer vision, and you might remember this, Luke, that we were looking at questions in taking an image. What is in this image? Is there a dog in this image? It's like, ""I don't know. Maybe."" And it was barely above random. And now, in 2018, I think the lines crossed where the machine performance is actually above that of a human. And that is for tasks where humans are actually good, that is humans know how to recognize dogs in images. We're trained to it from birth, and yet the machine is outperforming a human. When you're looking at tasks where humans are actually not so good, like, for example, recognizing biological patterns in images, or even worse, in sequencing data, the machine is just so much better than a human. Yeah. This is a broad question I didn't expect to ask, but I'm curious your thoughts. What were the key insights, you think, between 2004 and 2018? Was there one thing that you think was really the change? I think it's a combination of three things that came together. One is, yeah, we had better machine learning models, which were often just a matter of having the willingness and courage to not just look at simple models, but be willing to bite the bullet about models that are not convex, that there isn't just a single optimum that really have a lot of dependence on exactly how you optimize them. So that's one thing. The second is the existence of large enough data sets that one could train such models despite the complexity of the space without over fitting radically. And I think that's a place where contributions such as image net and others, which really created large enough data sets so that one could actually start training those models were as important as the models themselves. And then the last one is compute at the push of a button. It used to be that, in those olden days, I'm feeling really old right now, that when we had to do anything that required large amounts of compute, we had these local compute clusters that were painstakingly maintained by local IP people. And you ran your job, and it took six months to run. And that you hope there was no memory leak. And then at the end of the process, you never ran it again because you would never risk doing it more than once. And now, we have the cloud and you can do this on 10,000 machines and your results come back in a day. And honestly, to me, that's been as, or more, transformative than anything else. Because our ability to do that, combined, by the way, with platforms such as PyTorch and TensorFlow or an Adam that allow us to program much more quickly, we're now able to experiment and improve our models in an iterative loop that we were never able to do before. So even if our initial models like, eh, the second time and third time and fifth time and 20th time that we iterate and make it better, it's going to get better and better over time. And so that combination of better software, better tooling, I'm not talking just the better machine learning, just the tooling around the machine learning and the better cloud computing, which enables this rapid iteration cycle, has frankly been, I think, as, or more, transformative than anything else. Which kind of leads to a question I had in biology in particular, which is, are there datasets available in biology in the same way as in vision? There's an impression that there's more proprietary data, I guess. So that's, again, something that's changing. And one of the datasets that has been most transformative, I think, at least from the work that I've done, is the UK biobank, which is 500,000 people with genetics, with clinical outcomes, including longitudinal clinical outcomes, and very deep phenotyping that includes different types of imaging and blood biomarkers and urine biomarkers and a whole bunch of other covariates like environmental factors. And that data set has, on its own, I think, been truly transformative, both in the development of new methodologies and in the insights that it's given us about human biology. There's been other data sets that have been, I think, also very important. They aren't as large or as carefully curated, which, I think, has limited, to some extent, the impact relative to the UK biobank, but still have been quite significant. So there is the TCGA, which stands for The Cancer Genome Atlas, which is a reasonably large cancer dataset across different tumor types. There is the... Let's see, the GTEX dataset, which speaks to different gene expression across different tissues and different individuals, so that you can look at the variation within an individual across their tissues in their gene expression, but also for the same tissue across individuals. So you can kind of have this be like a two-sided matrix. There's others that are like that, end code, which speaks to DNA markings across different cell types. So I think there is more and more of- ... things across different cell types. So I think there is more and more of that available that is not entirely proprietary. There are also some on the chemistry side. They don't then by and large, with a few exceptions, like the UK Biobank being, I think, the best example of something that is truly high quality, truly well curated, with every experiment done exactly just so. And that is a challenge for a lot of people because noise in biology is much more of an issue than it is in many other domains. That is actually why we're building insitro the way we are, which is we have a significant wet lab component whose primary purpose is to generate large amounts of data so that we can train the models in the right way. Is there a notion of transfer learning in this field in the same way as in vision or are the problems just too different? I think that certainly there is transfer learning. And even in images, there have been examples where people have trained resonant models on images in the web, and then done transfer to microscopy images. Which is incredible, right? Isn't that amazing. I know it's amazing. Isn't it? So, I mean, I would expect it would be even better if you train the microscopy images. But still the fact that this actually does translate is, I think, a pretty remarkable achievement. I think there's other examples that one could generate. People have done a fair bit of work, especially recently, on pre-training of say graph neural network models for chemical structures on large numbers of compounds. And then using that type of encoding as a pre-trained model for something, for which you have less training data. Like more specific properties of compounds. So I think that's actually one of the big areas that, I think, will become important over the next few years. Is how do we make use of some of those larger data sets that maybe have less supervision as a way of enabling us to build models that are useful on a smaller set of data set. But you actually built a wet lab to collect data, which is super interesting. How does your team break down into people doing machine learning and people doing, I guess, biology, and people doing other? So if you think of the small fraction of people who are like GNA, the composition of the company, for most of the time, used to be about 50/50. So initially I think we had a few more wet lab people because you need to start making data before you can really have a lot of data to analyze. But even then we had some computational people who helped with making sure the experiments were designed right. And then it became about 50/50. And then now we're actually starting to grow the next set of functions. Which is once you have insights that come out of the biology, you actually have to make drugs. And so we're starting to build out functions in chemistry, and drug discovery. And so the balance is shifting a little bit more towards the life sciences. But it's really quite evenly distributed among those functions. Oh, that's cool. And I guess it sounds like your ambition is to not make just like one drug, but to kind of build a process to make lots of drugs. That's right. And I would think with a hit rate, I just picture managing a business where the sort of hit rate of something is like one in 10, or one to 20 sounds incredibly stressful. Is that the case? It is incredibly Stressful. Especially when each experiment cost you tens, or maybe hundreds of millions of dollars, at least today. So how do you navigate that is certainly something we think about a lot. How do you make the process faster? How do you make it less expensive? How do you fail fast so that you don't end up spending the hundreds of millions of dollars on something that is going to fail? So how do you recognize earlier that something is the wrong path? That actually is the point of what the machine learning is looking to do. And how do you ensure that you have enough capital to give yourself multiple shots on goal, in case the first couple don't work out. Right. Right. Although, you've done a good job with that. It looks- Oh, yeah. I can't complain. Well, I guess I want to make sure I also ask you about some of your other work. And I wanted to ask you about Coursera and I guess teaching in general. I think you're not teaching anymore. Is that right? No, I'm no longer a professor at Stanford. I'm an adjunct and it's great to have some connection back to the department, but I don't teach anymore. It seems sad to me because, I mean, I just wanted to say you were such an amazing teacher. Like you were- Aw, thank you. ... notoriously difficult teacher. That was kind of your reputation. And you weren't kind of the warmest teacher, but you're like memorable, 16 or 17 years later. It's just like a really, really excellent teacher. Like I feel like I just learned very quickly and efficiently from you. And then also when I TA'd for you, I got to see how much you cared about grading, which I really appreciate. It's interesting to see. I was coming from a math department too, where it's like, they just did not care about teaching or grading. And it felt just really good. It's like someone's here and really kind of cares to take the time. And so I kind of, wasn't surprised that you started a company around teaching, but I was kind of just curious to hear the story about it and how you thought about it. And what happened in the early days. So teaching had always been a passion project of mine, kind of like on the side. Because, as someone who's on like the research side of a top academic institution, top research institution like Stanford, you're not supposed to really invest a lot of time in teaching. So I was always a little bit of an outlier in wanting to spend time on that. Can I ask, what do you think that was, that made you want to do it? Because it really was quite evident that you cared more than anyone else about teaching. I guess I've always thought that education was just the door to opportunity. And that if you set someone on the right path at an early age, or rather you enable them to get on the right path at a relatively early age. Because I mean, teaching is not really a thing. A teacher enables people to learn and become who they can be. And they have to make the investment and want it. You can't learn someone, they have to learn. I just felt like it was an incredible enabler. I'd come from a family where I was privileged in that both of my parents had access to higher education. And I saw how much opportunity that created for me that others just didn't have. And I guess I've always felt and still feel, and really try and teach my children as well, that for those of us who have been privileged so much is expected. And it's our responsibility to give something back. That was, at that point, my way of giving something back is by teaching. And in fact, that was what led me to also eventually depart Stanford. Because I felt like my opportunity to give something back to the world in a much greater scale was available to me by founding Coursera. And opening up education to a much, much, much larger number than I would ever be able to teach at Stanford. And that's actually also what led me to insitro, because I feel like there's an incredible moment in time now in bringing together two disciplines in a way that could be totally transformative to the world. And I think it's kind of incumbent upon me. There's almost a moral imperative to make that happen if I can do that. And it's not something that many other people can do. And I guess, I saw you started another company, Engageli that's seems like a teaching tool. Right? And was that a reaction to something you wished Coursera did? Or? Yeah. So yes and no, in the sense that it was driven by the observations that we had in the pandemic, when all of a sudden I had two teenage kids who were thrust into Zoom school. And these are two kids that are academic high performers, that are by and large, pretty diligent. And, at some point I was kind of looking in on them and noticing that the youngest, after a few minutes in her class, making sure that the teacher saw that she was there, would turn off the camera and the microphone and spend the rest of the class perfecting her Sims game. Whereas the older one would spend the time going through the Netflix catalog. And this is like, okay, if this is what my kids are doing, despite the fact that they have all these opportunities, what happens to all those other kids who don't have that same set of privileges. And they're going to a school with much larger classes and teachers who have way less time to invest in trying to make the classes better on video. So that was really part of it. But truthfully, and this comes back to, I think, the thrust of your question Luke, is that originally when I was getting interested at Stanford in teaching, it was actually not originally with the only purpose of teaching the world. But also in trying to get teaching to be better even at Stanford. Because I felt like, okay, I got to spend, whatever, three hours a week with people like you in a class. And we were making use of that time with me just standing in front of the class, droning at you, and delivering a lecture that was not that different to what I delivered a year before. Is that really the best use of class time? Or can we spend the time actually engaging and interacting with each other, and really learning? Which is much more of an active effort than it is just sitting there watching a professor talk at you. And so this really was, to me, coming back to what had motivated me to go into a lot of capabilities that ultimately went on to become what we built in Coursera. And really create a tool by which people can learn together, even if they are not physically co-located. And what we've discovered is that the move online actually makes things better, irrespective of whether you're in the same classroom or not. Just because of the ability to flexibly chat with people who are in a group with you, work together as a team. And really create an environment that fosters active learning in a way that is very hard to do. If you just have a bunch of people sitting in a large auditorium with not great acoustics, all facing forward in fixed seating, but with a tiered classroom, looking at the instructor down below. So I think I'm hopeful that one of the few benefits of this terrible pandemic that we're suffering through is that we will not actually go back to teaching the way we did before the pandemic. But we'll have a better way of teaching. Interesting. Yeah, I'm remembering now that I think one of the things he did really well, I thought, in a in-person class was actually kind of watching when you were losing the class and then pacing. I remember you had this trick where you would ask, who does understand what I'm saying, which everyone should do. It's funny, I've taken that with me for the rest of my life in talks and stuff. I really appreciate it. But everyone should do it. Because a lot of times you would ask that and it'd be like a third of the people that raised their hand. And it was actually even helpful for me to know, as like a nervous student, that I'm not the only one who's kind of like lost track of where this is going. Yeah, a lot of people ask the opposite question, which is, who's not with me? It's like, ""Well, most people haven't even absorbed the question and you've already moved on."" Or, ""Does anyone have questions?"" And it's like, ""Well, I don't even know if I have a question because I haven't understood what you're saying."" So I think it's really important to create an atmosphere where the default is I'm not understanding rather the default is I am understanding, and especially when you are teaching complex material. Right, right. Another question I wanted to ask about, I remember years ago when I was your student, you were super interested in probabilistic graphical models, which were really interesting. I remember especially being interested and they've sort of... The thing that stuck with them is sort of causality. It seems like you can find that in data, which is really cool and surprising. But I was curious, have you maintained an interest in that? Has that field evolved in interesting ways? What's happened with them? I don't hear about them as much. Well, I mean, I think there's been a lot of discussions in the last few years about deep learning because of all of the big transformative things that deep learning has been able to do because we've been able to get away from feature engineering, which has been such a pain point in most of the tasks that we deal with. I think there is clearly still very much a need for understanding causality. When I think about the work that we're doing in drug discovery, the fundamental question that we're asking is, if I make this intervention in a human is it going to make a clinical difference? Is it going to benefit the human? That is an interventional question. If you confuse that question with an observational question, you very easily, immediately fall into all sorts of traps about correlation being different from causation and a lot of the correlations being completely going in the wrong direction from a causal perspective. So you find yourself intervening in symptoms or just downstream sequela that have nothing to do with the fundamental disease processes. I think even in machine learning more broadly, there's a growing recognition that that is one of the big unsolved problems in getting machine learning to go to the next level. I was at the NeurIPS conference, not this past year, but pre pandemic, and Yoshua Bengio was giving a keynote talk. And he highlighted that as one of the main unsolved problems, both because of its intrinsic importance, but also because understanding causality and the causal processes that underlie the world enable you to learn with much sparser data because you have a much more structured representation. So I think that what's likely to happen is that the pendulum has swung very much towards the deep learning side of the world, as it should because of the tremendous advantages, but I think it's now starting to coalesce. These two paths are starting to coalesce. We're going to see a lot of interesting work coming out on that front. Cool. Thanks. So we always end with two questions and I want to make sure you have a little bit of time for them. So the penultimate question is, what's an underrated topic in machine learning? Maybe I'll say it to you like this, if you had more time on your hands, what new thing would you investigate or look into? So I'm going to use this opportunity to give you two answers. One of which is maybe more directly to your question and the other one, which we didn't get to earlier, which is the why I'm doing what I'm doing right now. So I think on the pure machine learning front, what we discussed earlier is really a fundamental problem, which is, how do we leverage large amounts of weekly supervised, unsupervised data to learn a representation that enables us to then very efficiently learn from much smaller data sets? I think that's an area where, yeah, people have said, ""Well, there's whatever the image representation that we learned in ResNet and of course, word to vec,"" and there is others, but I don't think we've really sort of pushed this to the limits in terms of how do you bring these different types of data sets together? What's the right way of combining the objective functions in a way that balances things in the right way? So I think that's an area where there's going to be a lot of interesting progress of how do you learn and refine a representation over time. If I broaden this question out from machine learning specifically and ask where I think there is a really big opportunity for the world, it's in this convergence of these two disciplines, which is biology and data science and maybe engineering. So maybe it's three disciplines. The analogy that I use here is if you look at the history of science, there has been sort of epochs in history where one field has just sort of really taken off and made a tremendous impact on the world in a relatively short amount of time. In the late 1800s that was chemistry with the periodic table, and then in the early 1900s it was physics with understanding the connection between matter and energy in between space and time. And then in 1950s, it was computing and the ability to use silicone chips as a way of really doing calculations that up until that point maybe even not a person could do. And then in the 1990s and 2000s, there was a bifurcation. There was data as a field, which emerged from computing, but also from optimization and statistics and neuroscience, and I think it's really its own field. The other is what I call quantitative biology, which started to measure, finally, a very robust and reproducible and quantitative way aspects of biological systems. That's what gave us sequencing and microscopy and all of the things that I talked about before. And I think the next big field that's going to emerge is the convergence of those two fields into one, and I'm calling it digital biology. To me, it's the ability to measure biology with fidelity in its scale, use machine learning and data science to interpret the measurements that we get, and then use bioengineering techniques to go back and intervene in biology to get it to do something that it wouldn't otherwise do. That has implications in human health, but it also has implications in bio materials, and in agricultural technology, and in environmental science, and in energy science. All of these are places where the convergence of those two fields and this digital biology is just going to transform that space. I think that's going to be the next big field of the next, whatever, epoch of science. Wow. Well said. Let's go to the highlight reel. It's actually a good segue to our final question, which is, I would say this for insitro, you're trying to discover new drugs using machine learning. What are the practical day-to-day challenges right now of making that work? Well, so I think there are a number so I'm going to highlight two. One is that biology is really hard. You are dealing with live things and they're variable, and they depend on the exact temperature in the room, and on who the tech is that's manipulating them, and a lot of things that you don't normally think about and we don't have to deal with in a lot of the more exact sciences. So how do you create datasets that are robust enough and experimental procedures that are robust enough so that the noise does not overwhelm the signal and the variability does not overwhelm the signal? The second is that in order to do the kind of work that we're doing, you need to create a really unique culture of individuals who are able to sort of speak both languages, at least to a certain extent, and communicate with people with a discipline very different to their own. That's something that we don't have to do quite as much in many other applications of machine learning. So if you're doing machine learning for web recommendations, you don't need to deeply understand the catalog of items on the Amazon site in order to write the recommendation algorithm. That's not true for biology. You really need to understand enough that you can have a meaningful conversation with a biologist to our chemist. So the recruiting of people who either have that joint skillset or are willing to learn enough to have a meaningful dialogue and really work as part of a truly cross-functional team with people from the other disciplines... We don't train enough people like that. I think building the company with that kind of individual and with the right culture is something that I think about all the time. I think we've done a really great job of it at insitro so far, but it's definitely an ongoing effort all of the time. Awesome. Thank you so much. Great. Thank you. Doing these interviews are a lot of fun. And the thing that I really want from these interviews is more people get to listen to them, and the easy way to get more people to listen to them is to give us a review that other people can see. So if you enjoyed this and you want to help us out a little bit, I would absolutely love it if you gave us a review. Thanks.",7624 +Piero Molino — The Secret Behind Building Successful Open Source Projects,https://www.youtube.com/watch?v=iSivXjQWg_c,2181,2021-02-11,"So this model was predicting the classification of the tickets, and then we decided to build a model that was also suggesting which actions to take in response to this ticket and then there was also another model that was deciding which template answer to send back to the user, depending on what they were telling us and so instead of creating all these different models, I found that, that was a really nice application of multitask learning and so made it so that you can specify multiple outputs of multiple different data types and in the end, we had basically one model that was capable of doing all these tasks, using all these features and that was basically the base of Ludwig and then I started to add also images and all other things on top that and more people started to use it. You're listening to Gradient Dissent, a show about machine learning in the real world and I'm your host, Lukas Biewald. Piero is a staff research scientist in the Hazy Research Group at Stanford University. He's a former founding member of Uber AI, where he created Ludwig, worked on applied projects and publish research on NLP. I'm super excited to talk to him today. All right. So Piero, I'd love to talk to you about your time at Uber and the things you worked on, but I think the thing you're maybe better known for and the main topic is probably your project Ludwig. So maybe for some of the people that might be listening or watching, can you just describe Ludwig at a high level? Sure. So it's actually a tool that I built when I was working at Uber, mostly for myself. I wanted to try to minimize the amount of work that it would take me to onboard a new machine learning project and what it resulted in is a tool that allows you to train and then deploy the deep learning models without having to write code, and it does so by allowing you to specify a declarative configuration of your model, and depending on the data types that you specify for the inputs and the outputs to your model, it assembles a different deep learning model that solves that specific task and then trains it for you and then you can deploy it. So can we make this more concrete? So, what if my inputs were bounding boxes, is that something that Ludwig would understand if those images in bounding boxes, it would then sort of choose a model and learn, say predicting classes or something like that, would that work? So it doesn't right now. There's no specific bounding boxes. It's something like a feature that they're going to add in the near future but what do you do in general is exactly that. So you specify your inputs and your outputs and you specify what are their type. So for instance, if you want to do image classification, then you can say your input is an image and your output is a class or if you want to do information extraction from text, then you can have text as input and for instance, a sequence as output where the sequence tells you what information you want to extract from the text and any combination of these inputs and outputs allow you to create a different model basically. And is the idea that underneath the hood, it picks the best state-of-the-art algorithm for any particular kind of input and output, is that right? So it works at three different levels, really. The basic level, you don't specify anything, you just specify your inputs and outputs and the types and it uses some defaults that in most cases are like pretty reasonable defaults, things that are for those kinds of types of inputs and outputs, state-of-the-art in the literature, but you can also have... You have full control over all the details of the models that are being used. So for instance, if you're providing text, then you can specify the new one to encode it using an RNN, or you want to encode it using a transformer or a CNN or a pre-train model like BERT. You can choose among these options and you can also change all the different parameters of these options. For instance, for the RNN, you can say how many layers of RNN, or if you want to use an LSTM cell or a GOU cell, or the sides of the Eden state, all the parameters, and you may want to change for those models, you can change them and additionally, one thing that we recently introduced in version 0.3 is the capability to do hyper parameter optimization so that you can say, I want to use an RNN, but I don't know how many layers do I want to use and then you can say, I have this range between one and 10 and figure out which is the best parameter configuration for this problem. And what does it do underneath the hood? Does it have some kind of smart system for finding the best set of hyper parameters? Yeah. So first of all, the models that it trains are TensorFlow 2 models right now, but we're also thinking about adding additional back-ends, but that's what... So the output in the end will be a TensorFlow 2 model that you can use for whatever purpose you want and for the parameter optimization, there's also for the parameter optimization process itself, there's declarative configuration you can give, and you can specify if you want to optimize it using different algorithms. At the moment, there's only three supported, which is grid search, random search and a Bayesian optimization algorithm called Pytorch. In the near future we're going to add more. In particularly, we want to integrate with rating it as many, many of those algorithms already ready to be used and also you can specify whether you want to execute the upper parameter optimization. If you have a laptop, maybe you want to execute it just on your machine or if you have a machine with a GPU, you may want to exploit the GPU, or if you have multiprocessing and multiple GPU's, you can run the training in parallel and also if you have access to a cluster, then you can run on the cluster, a Kubernetes cluster with multiple machines with multiple GPU's. Does Ludwig include data preparation or data augmentation techniques? Is that something you can do with it also, because I know that's super important to many fields these days? Yeah. So for data pre-processing, there are a bunch of things that Ludwig provides and a bunch of things that it doesn't provide. In particular, because that's not a hundred percent the main focus, at least so far has not been a hundred percent the focus of the library. So we provide some function at some relatively basic functionalities and if you have some specific need for pre-processing, we would suggest to do some pre-processing beforehand before providing the data to Ludwig, but things that Ludwig does automatically are for instance normalization of features, some tokenization of different sequences or text features for images. We do resizing cropping or like pretty standard things, nothing crazy, but something that is useful for having like a kind of end to end kind of experience. In terms of augmentation, currently, we don't have any augmentation that you can do right out of the box, but it's one of the things that we want to add in version 0.4 of the package. I think one of the things that's striking about your library is, I think some libraries try to help people that do write code, do machine learning without a deep knowledge of machine learning but I think your library, if I recall correctly, it says right in the top, ""We're trying to make it possible to do machine learning without actually writing any code at all."" So that seems like a grander ambition. Can you talk a little bit about what made you come to that and maybe what design decisions you make differently to try to enable that? Sure. So I think to a certain extent it's a little bit aspirational too, right, because there is still something that you have to provide, in this case, is the declarative definition of your model but I believe that it's so much simpler to write this configuration file than it is to write code, than to some intents and purposes it actually opens up the possibility for more people to try out, to use these models. So that was to a certain extent, the intent. In terms of the design decisions, I think that the main one that allows for this level of obstruction is probably the choice that I made to be, as you were saying before, opinionated about the structure of the models and the fact that there are some data types that I support and some data types that I don't support. If your problem is within the realm of those data types that I support, then I make it really easy for you. If it's outside, then well, either you can go and implement it yourself, or you can extend Ludwig to actually incorporate also additional data types that you care about and those data types, the fact that you can compose the data types, so the compositionality aspect of it is what makes it general to cover many different use cases and that's probably the main, as they say, secret sauce, which is not so secret because it's an open source project, but it's probably part of where the magic is. Let's put it this way. Can you describe how you would compose a dataset? Can you give me a concrete example of that? A data type, sorry. Yeah. So again, one example we've been through some examples like text input, category output is text classifier but the interesting thing is that, so in some libraries, what you have is they provide you with some templates, like for instance, the 2D Core Create, I believe that allows you to create models for Apple devices, does something similar where you have a task which is text certification, and then you to have provide the text input and the class output and then there's another task that is, again, gives you some templates that you have to fit into. In Ludwig, it works the other way around. You start from the data and you look at the data that you have and for instance, if you want to classify an article, maybe you don't have only the texts. You also have information about who's the author and you also have- ... you don't have only the text. You also have information about who's the author. you also have information about the date when it was published. Maybe there is a subtitle and there's a separation between the title, the subtitle, and the body. What you could do with Ludwig easily, you can say, well, the title is a text input feature, but also the subtitle is a separate in text input feature, and the body is a separate input feature. The author is a category because maybe I'm working for a website and the website has 20 different authors, and information about the author will allow me to figure out ... Because maybe many authors maybe publish in a specific topic and so that's additional signal that you will have when you're trying to figure out what class this news article belongs to. Also time, because maybe a certain moment in time, there was a spike of interest in a specific topic. Knowing that an article was published in a specific date, that helps you figuring out what type of article this is. With Ludwig it's super easy to specify all these different inputs from different data types. It's just a list. You just say the name of your feature and the type, and it's a list of those things. That's all you have to do to have a model that combines all these different inputs into the same architecture, really. What do you do if the types of your data are inconsistent? Can Ludwig handle that? What do you mean by inconsistent here? What if my input data had ... Missing values might be the simplest case, right? But I'm thinking of the cases that people come to me with and they want to do some classifications, some crazy data set. Maybe there's sometimes multiple authors, I'm just thinking of all these ... Oh, I see what you mean. ... edge cases. How do you deal with that? I see, I see. Well, let's say for cleaning the missing values things, Ludwig does some of it for you, but you can specify a default fill-in value or you can specify to default to fill with some statistics, like with the min with the max, these kind of things, which are pretty straight forward. Ludwig allows you to do all these things, so that's good. But if the inconsistencies are bigger, like for instance, in some cases there's multiple authors, well, you either treat it as a different data type altogether. For instance, set is a data type in Ludwig. If you have multiple authors, you can treat it as a set rather than treating it as a class, for instance, as like a category. Because I have multiple of those data types, like for instance date is a data type, the geolocation is a data type and so on and so on, I think you will have relatively easy time to find a data type that fits the type of data that you have. Again, if not, Ludwig is even easy to extend, to add a data type that matches your specific use case if you want to. Do you have examples of people that use Ludwig that really couldn't write any code? Do you know people that have tried that? Yeah. There is this really interesting example that I've witnessed, I would say, of there are a couple articles online from a person who was an expert in CEO, search engine optimization, and they wrote a couple articles on a CEO blog about using Ludwig for doing some predictions that are specifically useful for CEO purposes. I believe, most of these people don't have a programming background, they cannot code. It was really nice to see people using it for that purpose. And another fun example that they have. I don't know how much coding did this guy knew, but okay. There was this application of Ludwig for, there's a published article by the Max Planck Institute on analysis of some biological images, I think it was about worms or cells of worms, I don't remember exactly. But the point was that the person that was using it was a biologist, was not a computer scientist. What he told me is that he would not have been able to implement ... He was using ResNets within Ludwig and would not have been able to implement a ResNet by himself. Ludwig enabled him to do this kind of research that otherwise would not have been easy for him to do. These are some examples of what you're talking about that I'm pretty proud of. Yeah. You should be proud of that. That's really impressive. Ludwig came out of though your use cases, and obviously you're a very skilled coder. What were you working on at the time at Uber that inspired you to make Ludwig? Again, the whole point is that I'm lazy and I don't want to do the same thing twice. Well, twice is fine. Three times I basically try to automate it for myself, for my own sake, right? Yeah. I was working on this project called COTA. There's a couple articles online if you're interested about it. It's a customer support model that basically at the beginning was we were treating the problem as a text classification problem. We had the input tickets and we wanted to predict what type of ticket this was, because depending on the type they were routed to different customer support representatives. And maybe just before you get too far into it, could you like describe what's the scenario, what's an example ticket and what would be an example class? Something like that. Yeah. I was working at Uber, so one example was the note, ""My ride was canceled. I want my money back,"" or something like that. The class, there were about I think 2000 different classes that the ticket could belong to, which could be appeasement request or lost item or food not delivered because there's also the Uber Eats side of things. Right? There was a really wide range of possible types of issues that could happen. Again, at the beginning we were treating it as a text classification problem, but then the PM working on this problem came to me and said, ""You know, there is availability for additional features here. Like for instance, we can access some features from the user that is sending this message, for instance, if they were using the Driver app or the Rider app or the Uber Eats app when they were sending this message."" That was again, additional signal that we wanted to integrate into the model. Well, I did it once and that was fine. But then they came back to me with additional features that were related for instance, to the ride that they were taking. I said, ""Okay, so these features, some of them are numbers. Some of them are binary values. Some of them are categories. Let's make it something generic so that if they come to me again with more features to add, it would be really easy for me to do that."" That's the path they covered for the inputs. Then the same up into the outputs because we had ... This model was predicting the classification of the tickets, right? And then we decided to build a model then was also suggesting which actions to take in response to this ticket. Then there was also another model that was deciding which template answer to send it back to the user, depending on what they were telling us. Instead of creating all these different models, I found that that was a really nice application of multitask learning, and so I made it so that you can specify multiple outputs of multiple different data types. In the end we had basically one model that was capable of doing all these tasks, using all these features. That was basically the base of Ludwig. Then I started to add also images and all other things on top of that and more people started to use it within the organization. Then later on, we decided finally to release it as open source because we thought that also other people outside Uber could find some value in using it. That's so cool. Do you anticipate more people moving to this model of not worrying about the underlying architecture of what's happening? What should people then focus on if they're using Ludwig? If you want to make your model better, what is there left to do? I think there's two aspects there. I would say, I believe, I may be wrong, but I believe that there's much more people in the world that doesn't know how to implement a deep learning model than people that knows how to implement deep learning model. Right? I would say I believe that there's also value that Ludwig can give to an expert in particular, because it makes it easy to compare different models, makes it very easy for you to have a baseline for instance. That is definitely something that is useful in many situations, right? But if you are a super expert and you want to implement, if you're a researcher and you're creating a new model, then probably you want to implement it from scratch and the full control over it. But I think there's the rest of us, the rest of the people that don't know how to implement a deep learning model and doesn't have the time and the resources to study it. For those people, I think there's a lot of value to be unlocked by using a tool like Ludwig. In terms of then what do you do if you're not writing your model? Well, there's all sorts of other things, right? There's first of all, you can figure out the upper parameters, both by hand and also automatically. Also there's also other things. Like you can try to, for instance, figure out on which subsets of data the model performance better or worse. Have some sort of outer loop kind of explain ability and then trying to make sure that your model is safe and that it's not discriminating. All these sorts of things. It's usually the way you actually approach these kinds of problems. You need to add more data in a specific way that tries to introduce and solve these problems in the behavior of the model. Right? I would say in general, this is like a piece of a human centered kind of process. The human has a lot of things to do in this process by labeling data, adjusting the model, integrating the model into a broader application. There's a lot still to do for the human, I believe. Is it part of Ludwig's scope to guide the human building the model into things that are likely to help the model perform better? ... in the model and to things that are likely to help the model perform better. I'll give you an example. I often help people who don't have a lot of experience train models, and some of the mistakes they make are kind of surprising to people that are in the field, but make total sense if you step back. I've noticed, in some cases, people will have so many classes that they don't have an example, literally even one example of every class, and then they're surprised when the model can't predict that class where they've literally not provided an example of that. And I can think of lots of different ways that people can shoot themselves in the foot when they don't have experience with this type of thing. Is it part of Ludwig's scope to help people avoid those bad situations? That's a really interesting question. I would say the scope is changing over time, to be honest, right? As I described at the beginning, the scope was to build a text classifier, and then it became a much more generic thing over time. So also with regards to what you're asking, it's something that we don't... So let's put it this way. Ludwig nudges you in a direction, but it does show, in particular, for model architecture choices and model training and building, it has some defaults that are kind of reasonable and helps you figure out easily with different parameters what to do. What it does not do right now is what you described, the more higher level kind of problems. Is the problem you're trying to solve a problem that is solvable with a machine learning algorithm to begin with, for instance, that's something that is right now out of the scope of Ludwig. You basically start with something that you believe could be useful, a signal that kind of makes sense and a distribution of classes, for instance, that kind of makes sense. This is slightly switching gears, but this has been a surprisingly interesting question recently. What do you think about Python as sort of a lingua franca of- What you're saying is very interesting because there could be some even relatively easy checks that one could do beforehand and return to the user saying, ""Oh, there are class A, B and C that don't have examples. Maybe you want to provide them if you want to have good performance,"" or something like that that could be easily added. So that's something that I will take into consideration. ... machine learning. Do you think that Python is going to stay the dominant language for people building models? Or maybe there'll be something even more high level if your vision is that people don't even need to write code to build these models. Yeah. Okay, there are several aspects of this question. I think also it depends on who is the user. I believe that, for instance, if you think about databases before SQL was invented, well, people had to code their own databases by hand. Well, not really SQL, but maybe relational database in general, introduction of those kinds of management systems. People had to implement their databases by hand, and they were using files and YARA keys as a way... The file system was basically an early example of a database, really. And then there was this change into the paradigm of the way that people interacted with data by using a language like SQL that is more declarative, doesn't require you to express how things should be computed, but actually what you want to compute. And I think that a similar shift could happen also for machine learning. Although, this is true for a set of users, which are the final users, those ones that use the models much less so for the people that actually produce the models. For the people that produce the model, I think... I actually love Python. I think it's a great language, has really nice syntax, is very simple to pick up, very simple to look at someone else's code and improve it and change it. So I think it's a great language, but I can also imagine that we could be moving towards languages that are probably a little bit more efficient. The efficiency of using Python right now is basically wrapping C stuff. Maybe there is a world where we start to write models in Rust. Even in Rust, it's a little bit too complicated probably. But I believe that they... Or maybe in Julia, I don't know. There could be some candidates language to dethrone Python as the lingua franca for machine learning. Although, I don't see that happening in the very near future, to be honest. How do you decide what default model you give someone for a certain configuration, especially when the research is changing so fast, and I would say especially maybe in natural language processing right now, which it sounds like is where Ludwig started? Does it ever get contentious to decide what default to put in? Because I would think that a lot of no code users, if they have no experience in machine learning, they're probably going to stick to the default, or at least even if they do a hyper parameter search, you have to constrain it somehow to some set of defaults. How do you think about that? This is a great point. Also, there are many aspects, in my opinion, that they're not... There are some researchers that are actually talking about these aspects, but they're not, let's say mainstream, in particular, in research. And those aspects are... Performance is one dimension that a potential user of a system like this may care about, but there are also other dimensions. There could be speed of inference or cost of training or length of training or carbon footprint of your models and so on, right? So it's really difficult to figure out a default that accommodates all these aspects, right? Basically, right now it's impossible. What I usually tend to do is to provide defaults that are on the, let's say less computational expensive side. So, for instance, I will not have as a default to use T5 as a model for encoding language just because the amount of users that could actually fine tune the T5 models of different model will be relatively small and also the degree of advantage that they would get over a smaller model that may be not as big as to justify the increase cost, in computational cost, right? So I try to balance towards the inexpensive, but leaving the option for the more expensive. So that's one thing I do. And on the other hand... This is something that I'm really interested in doing. I'm starting to do some little research around it. One thing that I want to do is I want to do a really large scale comparative study. This is actually a little bit more on what I do at Stanford more than what I do specifically for Ludwig, but I'm really curious in doing a large comparative study among the different models with different hyperparameter optimization values on different tasks. And maybe one interesting outcome of that could be something that looks like a recommender system that tells you, ""I have these new data sets with this amount of data of this data types. What model do you suggest me to use given these constraints?"" Because I think that the constraints are important. You may say, ""I want only to see models that will take less than 10 milliseconds to run inference on."" And so maybe they will rule out some of the more expensive, but also more effective models, right? So suggesting something that depends on the constraints I think would be really useful. Well, now that we have a weights and biases integration, we could give you the data of all the users that chose to make their projects open, and that might actually give you kind of real-world evaluation of the different things that work and don't work. It would be super cool to see if that was useful. Absolutely. This is something that you might... With your data you probably can already do, right? We could think about ways to collaborate on that, definitely. That sounds really fun. That'd be fun. Stepping back a little bit, one thing that I wanted to ask you is I noticed that you've been doing NLP work for quite a long time. I think before Uber you were at a startup bought by Uber. And before that, I think you had your own startup doing natural language processing, so you've been doing it for over a decade. I'm kind of curious the perspective of someone like you on kind of the new stuff that we're seeing. Do you feel like GPT-3 is a real step function change in the quality of NLP and kind of changes the possible applications, or was it sort of inevitable? How do you look at the field, and how do you feel the field has changed in the time that you've been working in it? Yeah. It is true I've been working for at least 10 years right now, basically, in the field, so I've seen quite a few waves. Tasks that were interesting 10 years ago are still interesting today, so there are many things that were unsolved back then and still unsolved right now. We did progress in terms of performance, but I would say the general framework for the problems and how we approach them hasn't changed a lot. We're using neural networks before we were using SVMs, but overall there was not a huge change, in particular, in the way things work in industry, really. But in particular, the capabilities for if you shot... Actually, the capabilities for interacting with the model itself through language that is shown by something like GPT-3, those changed kind of the paradigm of interaction with those systems. And I think- ... with those systems. I'm not sure of the commercial usefulness and application of something like that, but what I'm sure of is, having a general system to which you could give a really small amount of examples and then the system picks on that and is able to perform the same kind of task that you've shown it on unseen data right off the bat, without needing specific training for solving those tasks, that's a very compelling thing and something that may bring the industry in a different direction, I believe. So, I see an interesting world in the future when that shift happens. Although, I still have my questions. We haven't settled on a final answer on how much and in which scenarios this actually works, to the point that we can actually use it. But let's see about that. I'm curious to see what the near future holds. Cool. Well, I can see we're running out of time, and we always end on two questions and I want to give you a little bit of time to answer these questions. The penultimate question that we ask is, what is a topic in machine learning, broadly, that you think doesn't get as much attention as it deserves? So, I think now it's getting a little bit more attention than it was before, so I may be a little bit out of time giving this answer. But I believe that something that I think it's very important is systematic generalization. And again, there have been work from Marco Baroni, Brenden Lake, and also Josh Tenenbaum on this topic, but has not being for a long time at the forefront of research. But it's something that is super interesting, and it's something that if solved may unlock many applications also of machine learning, where now we have a hard time applying machine learning. For instance, in scenarios where there's a lot of shift in distribution of data over time, or in scenarios where we need to train from less data. If we had a solution for systematic generalization, we could be able to apply machine learning models, especially in these scenario. So I'm really looking forward to more research on that topic. And could you define what systematic generalization means? Yeah. I may be butchering it a little bit, but at least the way I see it is, the fact that you have a model that can figure out a way to generalize beyond the training data obviously, but generalize in a way that is systematic. So, that learns that... I can give you a practical example of all the specific instances of a specific phenomenon, it behaves in the same way. It realizes that, for instance, if you're talking about text, that is invariant to the choice of entities or is invariant to the choice of some synonyms when it's returning its predictions. And I think it's really important because those models that exhibit a behavior like that are models that we can trust. Cool. Well, the final question is, and maybe you could really rely on your experience at Uber here, what's the hardest part about taking an ML project from conceiving of the idea to getting it deployed in production and doing something useful? Yeah, I think the answer to this is it changes a lot, depending on the type of ML organization that you're working in. Like if you're in a startup you can do things differently, if you're in a big organization it may be different. So I can speak, in particular, for the big organization kind of setting. I can say that, in particular for researchers, one thing that is difficult is then to put whatever you obtained in your research into production. And there's at least two sets of problems why that is difficult. One is a practical one, an engineering one. Usually the infrastructure for deployment is not the same that you use for training your models, and so there's a mismatch there that has to be filled. And also, maybe your models are a little bit slow for what are the needs for inference at scale. And so there needs to be some compromises there, and that's one of the problems. But the other problem, which in my opinion, it's more important. Because it's not a technical one, it's harder to solve, is a misalignment in the goals, really, of what the model should be doing. You may be optimizing your model with whatever metric that you care about. Let's say, for sensory loss, or maybe you have a ranking problem and you're optimizing for the mean reciprocal rank, or whatever other metric you're using for both optimization and evaluation. But in the end, in many real scenarios, those metric are just proxies for what you actually care about, and what you actually care about if you are doing, for instance, a recommender system is, you care about how many people are clicking on the items that you are suggesting. And maybe if there's a store, how many people are actually buying something. You may have the model that has 20% better MRR offline. You deploy it and people don't buy more, that's not the model that is going to be deployed. And so that's something that machine learning people usually don't think a lot about, and it's something that in my experience has been the main friction... There has been a friction point between developing something offline and then getting something deployed for real in front of the users. That makes sense. Well, thank so much, Piero. It's a real pleasure to talk to you. Yeah, thank you for the really interesting questions. It was really fun to chat with you, too. Yeah, thank you. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to these episodes. So if you wouldn't mind leaving a comment and telling me what you think or starting a conversation, that would make me inspired to do more of these episodes. And also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",6367 +Rosanne Liu — Conducting Fundamental ML Research as a Nonprofit,https://www.youtube.com/watch?v=iMxZIeOK5a8,2950,2021-02-04,"If ML Collective is about one thing, it's about open collaboration. We want people to think that science can be associated with employment. You join a job, you do science, but science can also just be your thing, your gig. You're an artist, you can join a studio to become a senior artist in that studio, but you can also just do art on the side. And science is at the same time, a collective effort. You need collaborators, you need people to work together with you. For that to happen, if you're taking science as a gig, then you have to be able to work with other people. But then we don't have really a culture there yet. Sort of if you analyze all the papers out there, Google people are working with Google people, CMU people are working with CMU people. Not exactly, but there are clusters, right? Welcome to the Gradient Dissent Podcast. I'm here today with Lavanya, and we have as a guest, Rosanne Liu. I'll say, I am super excited to talk to Rosanne, I had heard of her good work at Uber for quite a while, but then she actually came by Weights & Biases to give an open talk about one of her research papers. I was just so impressed with the kind of creativity in the way that she analyzed the neural networks. And we'll get into that talk with her today, but Lavanya, you actually found recent stuff that's even kind of maybe more exciting that she's working on... She had founded this amazing organization called ML Collective, and she's trying to democratize AI to anyone. And she's trying to ensure, even if you're not tied to one of these super prestigious institutions, you can still publish really cool research. And that is just such an important thing to work on. I'm super excited to talk to her. Welcome to Gradient Dissent, I want to start with what made you find ML Collective. It's such an important organization, and as someone who cares about diversity and also democratizing AI, I am curious about your journey. Thank you, thank you for having me. And thanks for saying that it's an important organization. I don't think we're there yet, we're so new and young that it can really go either way at this point. One wrong decision we make, it can really turn it to be a not so good organization. But yeah, ML Collective is interestingly, it's like a million things in my head right now, because as someone who runs a company, I'm sure Lukas, you have the same feeling. It's like there's a narrative always in your head going like, ""What is this company I'm trying to do? What is this thing?"" And then you little by little add ideas to it, and then at this point it's just so many ideas all combine together, basically representing everything I want to do in my life. One research, I'm wanting to be a research lab that people just do research together. They should be people better than me, ideally. And there should be people better than me, but maybe less experience so I can offer help to them, so it can feel useful, there could be a wide range of people. There should be people who like having a home, I like having a lab feeling. There are people out there doing things better by themselves, but who are really trying to attract people that work better with people collectively. It's a research lab, it's also a nonprofit, because I want to do charity on to help people. And I think one dimension of science should be done within nonprofit. I mean, we don't see a lot of things out of science or in terms of ML research going on in nonprofits, there's mainly driven by industrial labs because they have all the resources. But thinking if we can set a one small example to show great research done through nonprofits, that'll be great. And people can really open their mind when they think about their career choices, there's one more avenue for them to choose. Research lab, non-profit is also just like a co-working space that people just come together when during their gap year, or when people feel like just want to dabble in science a little bit, or they're moving out of science, but they still want to get involved in... Want to see what's going on in ML. So it's just that, it's also something very personal, something that sort of changed my life. But we'll get to that maybe later actually. Let's go there. Tell us the story behind how you decided to create it. Interestingly, it's such a big change in my life, maybe the biggest change ever in my life. And now I just wonder if every business... Every change you see in someone else life must be propelled by like a misfortune, because that's what happened to me. It was spurred by a misfortune. Basically the whole narrative goes like this, I was looking for a job, I was out of a job in first place, and I was looking for a job. And there's no job really that offers the things I wanted, a working environment to have. So we decided to build our own, that's as easy as this, but you can imagine not being able to find a job, just feel like being getting rejected everywhere must be very heartbreaking. And that's what happened to me, and felt like everything went wrong during that period of time. Is a conceptual changing my mind that I feel like instead of changing myself... Because what the signal out there is telling me is that I'm not good enough to fit the higher rubric of those places. Instead of changing myself thinking I should be more and more like what they want, what I decided to do is just change the hiring system. Have a whole system of my own, where we got to hire people, or recruit people differently from how they do it. How does your organization compare to academic organizations? Because most academic institutions would be nonprofits too, right? What's different about what you're doing? They're actually both. Academic can be profit or nonprofit, they can be public and private. We are strictly nonprofit, so we are funded by donations basically, when we are funded, we're not funded now. Once we're funded, well, we will be funded by donations. Difference with academic it's not an employment based. Everyone joined us as a member, they can have their own employment, that really gives people flexibility. They can view this more like a hobby, which we found is actually motivates people more than when they viewed this as a job, like they have to report, they have to keep track on how much they're performing. They have to report to a manager and stuff like that. We function very much like academics, because the key people in the organization are sort of like PhD. So that's like the only way we know how to run a lab. We heard research media is like a lab, everyone gets updates. There's no graduation, that's one difference from academic. It's not like you have to join and then five or seven years later you graduate. It's really flexible. You get paid even less than academia, which is unfortunate. But until we got funded, we can adjust that. But now everything's volunteered. What are some of the challenges in building a sustainable nonprofit like the one that you're building? And not just monetary, but other challenges too. I haven't made it a sustainable yet, I should be asked this question like three years later. How did you make it sustainable? Or why did it fail? But I can see a little bit glimpse of hope there because we're starting to get donations and get funders, even without going into a full donation raising, we're already getting interest from big companies and personnel that want to contribute to the organization. I feel like it's a different ecosystem, right? Just this like profitable world, they have their own ecosystem. Startups, they raise donations, sorry, they raise investments. And they promise that there's such a return many years later, they give shares out. In this world, you raise donations because you're selling this concept, and you feel this concept is so important. It's going to have such a social impact that the donors or philanthropists, they really buy into this idea, they want to do something back to the world. There's like a whole different ecosystem going on there. Many nonprofits as far as I can see survive in that ecosystem. I feel like we probably can have a shot there. I haven't proved it yet, but that's my idea. Also a lot of donations don't come in monetary, as you said, a lot of people that are our current members, they're really just donating their time to work with us, right? They're helping others publish papers by donating their expertise at their time. That's actually way more valuable than money. And there are also donations of compute credits. AI research is expensive, we need all forms of donations, but when they all come together, I feel like there must be a model that is sustainable. I need to prove that, but I am hopeful. What kinds of research are you doing that you think might not happen somewhere else? That's actually the best part. As a nonprofit, you're not driven by goals or anything. It's not like you have to prove to someone that I have to make this object detection thing better than 98%, or something. It's really just curiosity driven, you can do anything that interests you. And also the management of ML Collective is not hierarchical, we don't have a central manager of any sort. You can start projects any way you want, and you will be the lead of the project as long as you're a member. It's really driven by individual members' interests. So that's the best thing. If you coming from a physics background, you maybe have something to do with thinking about how physics is related to neural networks, you can do a project at that. You can tell people about it, people who are interested will join the project. Then you formulate a team of your own, you will be the lead of the team of that project, it puts you through. Maybe the next project you want to join someone else's, because you want to learn something, you want to learn, I don't know like, how things work in brain. Enjoying a more like a neuroscience project where someone else is leading, you're more like a happy follower. That would work out. Your role would be very dynamic. Yeah, so the best, the answer is we're not limited to any specific topics, it's really driven by individual members interests. Could you tell us about some of the things you were working on? Yeah, I think most of the things are published, and we put it on website. Looking at my own profile I feel like I've always been someone in ML, but just dabbled around different topics. It's like maybe I'm not as patient as most scientists, just trying different things. And also neural network the whole machine learning is changing so fast that I don't think anyone can confidently say that the next year, this is going to be the biggest thing for any breakthrough from any small field would become the biggest thing, if more people spend time in it. In the past, I've done vision projects, NLP. I have recently had an LRP project. The very latest project is about continual learning, but that's also interesting concept that I like. We have projects about network pruning, just like whatever is going on out there, we take a look, take the recent paper, look at their code and try to implement it, run it and find false in it, or find things that we didn't understand and try to understand it. I feel like a lot of our listeners might be thinking, this sounds like a great idea, and a great place to get involved in. Do you have that in who is your ideal person to join ML Collective? Yeah, you can think of people as different categories, but of course every single person is always a combination of different qualities. But you can think of people who are really like the lead experts in those subfield. They really want to push that sub field forward, but they're lacking resources in terms of like people, maybe they are a really senior researcher, but their job doesn't give them reports. And for whatever reason they're not managing people yet, but they want to have their ideas executed, they want to influence people. That kind of people can join as sort of a thought leader, they can lead a project and other people can join sort of workout research project that way. And there are people who are having free time, and want to run code, want to sort of follow all those projects that are led by those experts, those people are also welcome. But we're mostly welcoming people that are not offered the opportunity at the big industrial labs, because those people, if they can get into industrial lab, they probably wouldn't be interested in us anyways. But also they're not the people that we're trying to attract, because all we want to do is to serve as a diversifier from the industrial labs from academia. The people that can not be hired easily by them, you can think of what kind of people they are, probably don't have a PhD, right? That's one of the biggest reasons that they don't have a resume that looks immediately hireable. They started coding since they are a teenager, but then they never really pursue a higher degree that way. But it doesn't say that they're not a good researcher at all. Probably people who are changing fields, they have a PhD, but in seeing something else, they were trying to get into ML, and it's much, much harder for them to just immediately get a researcher's job in those labs, so we also accept those kinds of people. Basically, we've tried to serve as a diversifier, anyone who having a difficulty getting into those places, but still want to do science like if they were in those places, then we work on those people. Can I ask another one before? Yeah. You talked about diversity, which we care about a lot as well. And I feel like every company says they care about diversity, right? But what's one concrete thing that doesn't require a lot of resources that any company can do to get more diverse talent in? That's a really good point. There are people in the company who cares about diversity, but then when it comes to what their main goal is, because the company if they're making profits, their main goal is still to make the company run. Diversity becomes the secondary thing or even third thing that they care about. And that's when things broke down, because if you're going really just for productivity, for the speed of producing the next paper, then you wouldn't care about diversity, you would just hire the person who can quickly produce a paper the fastest. You really need organizations or institutions that put diversity as the first, no, first word citizen or whatever, that's our first goal. Nonprofits is sort of like the thing that I'm thinking about, because they are not going after profits, and not going after productivity. They're not trying to submit to every conference because they want to show status, their whole job is to help people to level the playing field for people. That's the way I'm thinking about it. Well, I guess like I would love to hear more about some of the research that you're doing currently. I remember looking at your work on the last change allocation to try to understand what neural networks are doing and I think that was such a cool idea. I wonder if you've had any chance to follow up on that one? I remember giving the talk at Weights & Biases, and you asking very great questions. Back then I didn't know that you are running Weights & Biases. It was like that person has great questions, but I didn't know that you are the founder. Wow, I'm so touched, I appreciate that! I was just impressed by your questions. Yes, that work is LCA, Last Change Allocation, that was published it was NeurIPS. 2019 a year ago, company of that. It was led by Janice, our resident back then when we were in Uber. Basically the idea is that we can break down the loss change on to each parameters. You can clearly visualize and see how much each parameter is contributing to the training of networks. Just in the training sense, we're not talking about validation or generalization yet. And surprisingly you see that half of the parameters hurt the training, probably understandably because everything's stochastic. We add noise into the process, because we use stochastic gradient descent when we use mini batches, and we sort of reduce the optimization to be linear. All those are contributing to the noise in the process, but the still the amount of noise in the training process was surprising to us. The whole work was basically like visualizing all those... How many parameters? What percentage of parameters was hurting? And then broke it down into each layers, and we found that some layers hurt more training than other layers, especially the last layer. Actually a very easy follow-up work would be... We proposed that the last layer should use a different momentum term, and we did a small experiment there. And so that it improves. I don't know if anyone from then on like training networks were using a different momentum term for the last layer, but they should. And this has basically in every single step, or is this over a larger period that you see the half of the parameters hurting? At any given time, there's over half of the parameters that are hurting and then across the whole training, half of the parameters hurt overall. If you accumulate all the contributions together, which when you add them together is the exact training loss from the beginning of training to the end of training. Any moment there's half of premise hurting and then throughout the whole training is also over half. And also if you track one single parameter, the thing is that it hurts half of the time. It's not really like if we can catch this criminal, and then just ban it from making changes to the loss, because they also jump around the hurting camp and the happy camp. I mean, it doesn't seem surprising that some of... And by hurting you mean the parameters they change in a way that makes the loss worse, right? Do I have that right? Yes, makes the loss go higher. And so it's funny, it makes sense that in a stochastic process, some would be making it worse, but it seems so surprising that half, right? Because overall the loss does improve over the steps. Yeah, exactly. What's going on there? Things that we didn't understand, like many things in neural networks that we didn't... We sort of get the idea, but then until we see the data, we're like, ""Yeah, that makes sense, but still is surprising."" Many papers are like that, and those are the papers that I aspire to write. Those that you sort of have intuitive sense that this is something that's going on, but until you see the data, you're still surprised by the amount of it, or the actual extent of it. I don't think you understand? That's so cool. I wish you had a chance to follow up on that. Running an organization, do you find yourself... Maybe this is asking for a friend kind of question, but do you find yourself spending a lot of your time in more kind of administrative tasks, and recruiting, and things like than actually doing research? Yeah, exactly. I start to doubt whether I'm still a researcher, because every day I look at my time, I'm like spending half a day designing the logo because we need to have a logo. And just like no one's is working on it. And then the other day, because we were organizing a event, I'm just spending all my night designing a gathered.town layout. I'm like making houses, making rooms, make sure people can go to different rooms and things. There so many administrative things, but that's also one of my goals. Honestly, the next 10 years I feel like publishing papers wouldn't give you so much value as before, because there's so many people trying to publish papers. What would give me more reward is actually helping people publish papers, and a concrete goal actually of mine is just end up in people's papers, acknowledgement section. That's all my goal. I'm not trying to be a co-author anymore, just because I don't know. I don't think it's a field that I still want to... I want to be close to of course publishing, I want to publish as much as I can, but I also want to remind everyone that the publishing scene is going to be very different the next decade, just because you see this huge influx of people coming in and trying to publish papers. Almost every idea has been chewed over a thousand times, it's so hard to come up with an idea and then social researcher find out, ""Oh, no one has ever done that."" This is impossible. Someone is doing that somewhere, which is to say that researchers right now... ML researchers right now are having a hard time, if I can help them a little bit, how to help their paper improve or be different, and make the success rate of their paper getting noticed or published slightly higher, then I will be very happy. I guess, what general advice would you have then to someone that's trying to get something published? What are the mistakes that you see first time people or outsiders make and what kind of help do you typically give to someone? I feel like the things that our reward function is delayed, right? We go into ML research, liking it because we saw other people in research maybe like few years before us, and they gained reward out of that. They publish a paper and it was so recognized and they have such a fame and recognition and everything. So we want to do the same thing, but the difference is, we live in a delayed time line, when we get into it, the scene already changed, but we don't know. I really want to remind everyone that if you are getting into ML research now, publishing is very different than before. Before if you have an accepted paper at I don't know ICLR or NeurIPS or CVPR. You're basically, you're there, you can probably get a job that you would want. Get a dream job, get a position of something but not anymore. Now I think the next people will be looking at citations, even if you get a lot of paper published in peer review conferences, people will look at different metrics now, because there are so many papers getting in, and so many people having their papers getting in. The basic suggestion or advice is that you should try to adjust your reward system to be different from why you came in there, if that makes sense. I mean, it just seems like you should just make things even harder for yourself, right? You can't just publish a paper and have to get citations. Is that a good summary? No, that's why you should be looking at other things. You should be really just looking at love of science. I want to do this for the love of science, I'm not trying to... I do this piece of work and not to... Well, if it gets published, that's a confirmation that is a good science. But the basic thing that's important is that it's a good piece of science, I think that's what I want to say. You can do a beautiful work, put in an archive, don't worry about whether it gets accepted or not, because there are so many noise in that whole thing, the same as neural network training. There are so many statistics that the same paper was not changed or submitted to three conferences got rejected, rejected, accepted, because it's just random chance. Every time you're just drawing a lottery ticket or some sort. Don't care about that, don't care about really this true acceptance or not into a conference, really care about the quality of the signs you put out there. Because if it's on archive, you have your name on it, it's going to... That means something. Change your reward system to really care about the true quality of science, and remind yourself that you are in here for the love of science, not for... Of course some people are in here for it too, so that it promises a better future, and there's nothing wrong with that. But those will probably stray you a little bit away from the path, and maybe makes you a little bit miserable than you already are. What's the key to doing good science as an outsider? How do you do that? That's actually the idea of running ML Collective, I feel like there's so many problems these days in a world that people don't believe in science, right? I'm not saying ML Collective is the way to change that, but I sometimes think if you can get everyone, not even everyone, like the majority of American to publish one paper in their life, maybe they'll just believe in science more. Once they go through that publication process, they see like, ""Oh, to put this statement out, I need to try everything around it, do ablation study, compare with all the benchmarks."" They will become more careful when they put statements out. I don't know, this is a weird argument I'm making, but I feel like if I can get more people to do science, not for life, just like publish one paper in their life. I think everyone's attitude towards science will be better, they will believe it more. We probably wouldn't have all those problems out there in America that people don't believe in science or other things. I don't know, that's my dream, of course. That's a great idea. Also I want to address the other end of the spectrum which is, all of these people who are trying to keep up with all of the papers that are coming out. And maybe you can use this opportunity to talk about this amazing paper reading group that you've been doing for like three years now. What's your advice for people who want to keep up, and what kinds of papers should they take, and how should they go about reading them? There's no better way actually, because I think this is like our first time of facing this problem, so there's no historic lessons that we can learn from it, that there's huge influx of paper. For now, I still trust those that are published at peer review conferences, but we know that there's a lot of noises in there, but I trust these slightly more than papers that just put on a type. I sort of have like a general sense, there are many people like me out there running paper clubs or YouTube channels. They dissect papers, each of them of course has their own criteria in judging papers. But if you accumulate more of them, like they average how to... I think it's representative of the overall quality. I think like a shameless plug, I think by now, I'm a good discriminator in all the sub-fields of ML. By being a good discriminator, I mean, I can sort of judge what's a good paper, what is bad. I might not be a good generator in all those sub-fields that I never published in. But being a good discriminator is the first step of feeling like I can run those things. You can sort of trust the papers day selected, but then you have to remember to accumulate it with all the other people's selection together, more balanced view. Can you give us a little window into your process for being a good discriminator of high quality papers? I just read a lot. Some basic elements, I feel like a lot of papers are missing. Maybe there are the people that are coming into ML research from different fields, or from a non researchers background, which is again like why I feel like ML Collective is important. Get people into this paper publishing process, and tell them what are the basic things we have to do compared with baselines out there. Try different variations of the method that you're proposing and that's ablation study. I see so many papers out there that have huge diagram, right? Signal goes in, and then there's so many branches of things, and it branch out. And then this is the end result, and they say that this whole system works much better than existing systems, but that's not science, that's a good engineering. Great that you made it work, but what does it teach us? Right? Is this branch more important than this branch? Why did you branch out this way other than that way? A real good science work should be, I think, inspirational rather than intimidating. That huge diagram is just intimidating. I built this huge thing, it worked, I'm not going to tell you how, because I hacked them together and worked. Maybe there's scientific value in it, but to be a good scientific article, you have to tell us what things you have tried, why this branch and that branch did you do ablation study? Did you try to turn off this branch? What was the thought process behind it? How does your work inspire other work, maybe in different fields to borrow the same thought process to produce their science in their subfield? Do you have a favorite paper over the last few years that kind of exemplifies this of simple difference and then a clear insight? Yeah, there are many amazing papers out there. Am I allowed to say my own? Absolutely, absolutely, yeah tell us, which is the paper that you're most proud of? Actually I really like an early paper of ours called, Intrinsic Dimension. It's many years ago, not many, many, but in machine learning feels like many years ago. It was published in NeurIPS, sorry, I cleared 2018. Intrinsic dimension space may need you to take all the parameters. 2018? That's like two years ago! I know, but it feels like forever, right? That's amazing. That's now this, when you look at papers you're like, this is the 2018, probably they're better worse than those though, probably, I shouldn't be reading this paper at all, but yeah, it's only... What is that? I think that's only two years ago. That paper has to do with measuring this basic property of a neural network. Neural network has so many things along associated with it. There's parameters, there's a large parameter accounts. If you imagine putting every parameter just together into a big vector, it's just a super long, long vector. And then you reduce it to a shorter vector and you only train the shorter vector. And how do you map from the shorter to the bigger? It's just through a matrix, it's a linear mapping back to the big factor. Basically you're saying that even though this network has 10 million parameters, maybe the dimensions that you can make changes is much more smaller than that big number. And there's a number out there that's much more smaller that says something about your network combined with your problem, combined with your data. That's how easy or hard this natural combined with data and problem is. So that's- Wait, sorry, you can actually do that kind of... Because that's going to be a lossy compression, right? You can actually do that, make it much smaller without hurting the performance? Mm-hmm (affirmative). Well, I think now it becomes not surprising, because now you can prune. So pruning is like that is access aligned reduction, right? Where you reduce big vector to a smaller one by basically masking some of them as zero. But back then we were just doing a linear projection. I can totally do it because a lot of parameters in your neural networks are redundant. Not that they're not useful, well, LCA also teaches us that. Not that they're not useful, they're just they provide a better or different loss landscape for you to train, but you can definitely train it within a much smaller landscape. Well, if you think about it, this huge landscape that all the parameters help construct, leading to an end point where there's better loss. If you can draw a line from the starting point to the end point, that's just one dimension. If you can just travel along that line, that's an intrinsic dimension of one. Any network would have a dimension of one that is trainable, but that one is very hard to find. That's almost just like very singular. These are trending dimensions saying, this amount of dimension, however you draw the line or the plane, it should still lead you to a good enough solution. Wait, but how could you take a... Oh, because you can pick the linear function that goes from your sort of simplified representation to the more complicated representation. Yes. The thing is, if you were allowed to pick the linear function, you can reduce the dimension to however you want, all the way down to one. But that's not what we want to measure, because that's just like one for every network. What does that tell us? The things we want to make the projection matrix and randomized, so then we measure how big it is, because you know that in a very, very lucky scenario, this can be down to one. With that knowledge, you should know that by just randomizing, there should be a number that's larger than one, but should be smaller than the super big factors to start with. I see. And so how much smaller can you go? And is it like suddenly there's a drop-off at a certain size, or is it sort of a smooth deterioration of performance? Back then, basically you have to try every number. That's more of a science investigation, it's not something that can help you train the network faster. Because basically you have to try every number from big to small, small to big until it crosses the threshold. Whatever threshold you want to be. We'd pick a threshold that is 90% of the full network performance, or you can do it 99, you can do it 85. It's up to you. Pick that number and that number is interesting, I want to make sure that I can remember the number so that probably wouldn't... But for MNIST plus FC Network, I think is 750, is much lower than 784 was just the input dimension of MNIST ditches, which makes sense because there are many black pixels in MNIST in portages, but that number is very interesting. And then for CIFAR, I think it's like 19,000. That sort of gave you a sense, how CIFAR is harder than MNIST, but how much harder? Probably 10 times harder. But this is also- Sorry. 19,000 is probably the- Sorry, you're modifying the network or are you modifying the input? You're modifying the training procedures. Once you pick a network, you pick a task, data is there, network is there, initialization is there, then your loss landscape is fixed. Now you're modifying the training procedure to let the point move, not in any direction, but in a restricted plane. You can think of it that way. So you're modifying the training procedure. The training procedure means you're like first modifying the input data, sort of shrinking it before you put it into the network? I see you're only allowed to change these smaller set of numbers and that changes the network through a linear transformation that changes the parameters. How can you say like MNIST and CIFAR, wouldn't it matter also the network that was being used? Wouldn't a bigger network maybe have a different... Mm-hmm (affirmative), exactly, that's very true. But what we found interestingly is that for at least in the scale of our experiments, MNIST plus FC Network, fully connected network, you can make the network bigger, wider. The number's roughly the same. 750 was the number we got from MNIST Plus FC type of network, of course, if you make it huge, probably the number would change. But to the extent that we vary the size, they're sort of stable, which gives us confidence that data is a stable measure. But then if you change your convolution, your changes drastically reduces as you can imagine, because convolution is a much better landscape. It gives you a much lower intrinsic dimension. It's the same story when we switched to CIFAR, we switched to other tasks. You can also do RL with it, that's the interesting part. You can finally compare RL tasks with computer vision tasks, which people never really do, because people doing RL sort of know that this, I don't know, pong is harder than some other games that I don't really do outside, I don't know. And people doing vision knowing that MNIST is easy, CIFAR is harder. MNIST, sorry, you mentioned that it's much harder, but then they don't make this part of the comparison. But now we can, of course, it's not a very strict comparison because they are using different networks, but we find that some games are much easier than you thought. There's a Carpool game, has only a dimension of four, because probably you just need to move in four dimensions. Even though, what are the inputs into the Carpool game? There's not that many inputs, right? It's just the angle of the pixels. Oh, I see, from the pixels you can see the- Yeah, they're extra pixels. Oh, it's so cool. Yeah, it's an old paper from three years ago, two and a half years ago. A decade ago it sounds like the way you talk about it, how much time has passed. What is out there? Oh, practical application, could people use this to maybe take the networks and deploy them on mobile phones and other applications like that? Yeah, it's very interesting question. Back then what were we doing that we sort of claimed in the paper that is the scientific investigation, but there are some implications of reduction, because the whole matrix is randomized. So you can just save one random seed to regenerate that whole matrix. And then you train in such a small dimension, so the whole memory usage is much slower, sorry, much smaller. But actually speaking of this year, NeurIPS, which is coming up next week, there's a paper published there that actually took the idea that we had two years ago, a long time ago. They actually make it more useful, I think it was their method. They make a few tricks in the algorithm. It's no longer measuring this intrinsic property of a network anymore, but it becomes a better training method that they're able to train is such a sub-dimension better networks, or faster, or with all those memory safe. Check out that paper in this year's NeurIPS. Okay, cool. We should put in the show notes both of these papers. Random subspace training, something like that. And I guess you're also doing something in NeurIPS this year on open collaboration. Is that right? Could you say a little bit about what you're trying to do there? Yeah, that's the whole thing with ML Collective, right? ML Collective is about one thing, it's about open collaboration. We want people to think that science can be associated with employment. You join a job, you do science, but science can also just be your thing, your gig. You're an artist, you can join a studio to become a senior artists in that studio, but you can also just do arts on your site. And science is at the same time, a collective effort. You need collaborators, you need people to work together with you. For that to happen, if you're taking science as a gig, then you have to be able to work with other people. But then we don't have really a culture there yet. Sort of if you analyze all the papers out there, Google people working with Google people, CMU people are working with CMU people. Not exactly, but there are clusters, right? And there are each of us sort of like bears our own comfort zone of collaborators. We sort of rarely go out of the comfort circle, because it's... With any new people, it's like there's a friction of working together. We don't do that too often, and that creates a problem because we were so little purpose isolated, and new people find it really hard to join all those circles. At least as a new people back then, I find it really hard to just find someone and become their collaborator, because there's not a culture like that. The whole thing with ML Collective is that, we have members coming from all different kinds of employers. They work elsewhere, but they're willing to share their work within ML Collective, they feel this is a safe space that you can share your work, get feedback, maybe become co-workers with people you never would have, because you work in different teams, different institutions. So that's the whole idea. There're many people who sort of are carrying this culture around, that's why we invited all those great people to the social, talking about how they have done that. People that are holding office hours, actively outreaching to people, trying to mentor people on their spare time. There are companies that entirely run science in open way, they broadcast other meetings, they put everything out there on GitHub, even the date. Every day commit of course out there. There are many open cultures out there, so we want to gather people sort of to discuss, the pros and cons of this method. Of course science is much slower produced this way, because you have to uncomfortably work with people that's not familiar to you, but they really improves the overall well-being of the society. It might be faster if you make more connections in the sort of global brain, I could imagine that it leads to more interesting science. I don't know, but then there must be a reason that people are not doing that a lot. I feel like it must be slower in the case that there's always this friction period that you are getting to know each other, what each other's work style is like. I don't know, I feel like people have tried that or it all must be out of this fear of how hard it would be to work with other people. Well, I would think people might just feel shy, it's hard to go meet a stranger. Do you do anything to sort of facilitate, just getting people talking to each other? Yeah, for now we start this organization where people just join from anywhere. I'm sort of like the hub, I know everyone, but they don't know each other to start with, but then if you do biweekly meetings like we do, we talk about research every time. And then you two can be commenting on the same graph and then you realize, ""Oh, we're thinking the same way."" You're like-minded people, we should talk more. Than they can talk offline, then I'm done. I'm like a matchmaker sort of link people together. I'm very happy if like two people that didn't know each other now work together. I feel like my satisfaction comes from that. It's fascinating to me that you're taking the credentialing aspect out of these research labs and almost replacing them with collaboration. Is that the reward function or is there a definable function now? Yeah, the reward function for me is definitely just... If I can reduce the whole thing, ML Collective does to one metric, that will be the number of new collaborations formed. That will be my reward function. That's reward function of MLC, but my personal reward function, as I said, is how many papers that have my name in the acknowledgement. That will be my near term reward function. I'll be very happy if people thank me in their acknowledgement. That's great. What about for the researchers in MLC? What's their reward function, do you think? And how is that different from those people who are working at traditional labs? Curiosity driven, that's one. We're not goal-driven, we're not trying to beat any benchmarks. We're allowed to do that, I mean, other labs probably also have some elements of that, but I don't know, that each lab has its different cultures. Some are more open, some are more goal-driven trying to make sure... The whole thing that people about on Twitter is like, I have 28 papers in this conference, so we would not be saying that, because we can never reach there. But also that's not our goal. It's not to get this number of many papers in a conference, it's more like we can have the scientific discovery purely driven by curiosity, like the intrinsic dimension paper. It was just us thinking, hmm, everyone trains neural network this way, was a big factor. Can we train with a small vector? It's like, no reason why we have to do that, but we just thought about it and within must be, right? Because thinking if you can draw a line, there is a dimension one out there, but how hard is it to find that dimension? How hard is it to find that even with the random initialization? I would, if I were to control it, but again, I don't control with the directions of research in ML Collective, but if I were to control it, I would encourage everyone to see it as a fun thing. ML researchers thesis, they're so miserable. I mean, I was part of them, so I know that, every day they're like, ""Oh, this conference is coming up, I'm not submitting, I feel so bad, I'm such a failure."" Really, I just want to make this a fun thing, a gig they're doing, they get to meet new people, they get to work with people different from them, better than them in some ways. They get to feel like they're helpful in other's projects. I think it might be eyeopening for people listening to this, that someone as successful and credentialed as you could feel like a failure. I feel like it's been occupational hazard of the field, but I really do think most people listening, or watching this will be surprised to know that. Oh, exactly so much. I didn't realize, it because it's not so miserable that I'm just crying every day, but it's this like a mild level of depression, which is the worst. Because every time you're confronted, you're like, ""I feel bad, but should I be feeling bad? I'm having this amazing job, and I get to do science in the industry getting paid reasonably well."" You sort of counter yourself of the bad feeling, that makes things worse. From the outside, everything's glamorous, I get to publish every now and then, but yeah, I was miserable. And I realized one key thing that changed my mindset is that, I was viewing everyone outside of my team as competitors, and I'm just miserable because I feel I have to compete with them. And if they're published in 28 papers and I'm publishing zero, I'm losing. But now by running MLC, I see them as collaborators, or potential collaborators. People are out there, if we have the same ideas at the same time, the past new would be like, ""No, I'm scooped,"" but now I'd be like, ""Great, that means that's a great idea."" You can be my potential collaborator, I can talk to you, and you can join MLC, and help me, and help others. It's really like a mindset change at least for me, or maybe it just because I'm not getting paid right now. If you let people do something and then not pay them, they start to think that this thing must be noble, because I'm doing it, I'm not getting paid. I don't know which aspect is the one that changed my mindset to be from that to this. But yeah, there are many things that has changed. I have to say I really admire you creating the world that you want to see. I think that's super admirable and impressive. Thank you. We always end with two questions, and I want to make sure we have time to do that. The sort of second last question that we always ask is, what is something in the ML field that you feel like it doesn't get enough as much attention as it should? That's a good question, I would say understanding of things. I think the field of ML research publishing would become healthy if we start to see a wave of papers that just go. I think this little concept, batchnorm or dropout. And I studied so extensively that I wrote eight page paper out of it, I tried everything I can with it, without it in this network, in that network. And the end result is, we didn't find anything amazing, but we understand this concept 1% more, because that's science. I want to see a way for paper that's written this way instead of we'd be this, then what would be the average more, because that's very rare. I'm just like trying to go for a deeper understanding of one small concept, say like wide helps and... There are so many things we don't understand in the way that we train neural networks. And of course people, when you say, understanding people have different comfort levels in terms of understanding, I can see there are people out there having more of a hacker's attitude. They would think they understand something, if they watch a five minute video of it, right? There are more humble conservative attitudes, I would say like more of my scientists, peer have that as... Unless I published those lead author paper on this subject, I can't still understand it. Even if I publish one, I can't still understand it. There's different levels of things, but I hope people are going for a better understanding of things than in benchmarks. I guess the last question is, what's the biggest challenge of publishing a paper independently when you're not living in a big lab? There's so much of it, the lack of resource, the lack of support, the lack of people just telling you this is good idea or bad idea. Lack of discriminators, right? When you're publishing paper, you probably you are these generator of the paper. Secondly, lack of discriminators, think about gun training, without discriminator, you really can't train a good gun. All those things. That's why we want to recreate this great graduate school lab experience for everyone. You don't have to join a graduate school lab, you don't have to join a big industry lab to have the same experience like mentors or collaborators, peers. People just say, ""Awesome on your plans, or you should add one more line to that plot to make it more awesome."" Stuff like that, right? People you can bounce your ideas off of yeah. All that little things is... Of course, we know how hard it is for individual researchers to thrive over there out there. If ML Collective can help them a little bit, I'll be very happy. I think I'll sneak in one final question. If people are listening to ML Collective and feeling inspired, what's like a next step for them to get a little more involved or learn a little bit more? For a nonprofit, where we really want to get this idea out is, there's social impact we want to put out to send the idea for us is the open collaboration. For people out there if you're a researcher, if you're an individual researcher, an independent researcher, you can always come to work with us. There's many collaborators that here will be happy to work with you. If you're already a senior researcher, or established researcher, you should think of this concept actively. Every day, every paper you think about, did I help someone with this paper? Did I just work with the same crew of collaborators that I always worked with the past 10 years? Or did I put someone new on this paper and really helped their career? Because having a paper helps so much in someone's career, at least for now. Did I try to make the world better with this paper aside from the scientific pursuit? Of course, you are making the world better by just putting a scientific work out there. But did I give other people chances to work in science? Did I help someone underrepresented, or help someone from a non-traditional background get into science through this paper? I want to get people to think about this question actively. Awesome. Well, thank you so much. Real pleasure to talk to you, thanks for taking the time. That was a lot of fun. Thank you so much, Lavanya and Lukas. Thanks for listening to another episode of Gradient Dissent, doing these interviews are a lot of fun, and it's especially fun for me when I can actually hear from the people that are listening to these episodes. If you wouldn't mind leaving a comment, and telling me what you think, or starting a conversation, that would make me inspired to do more of these episodes. And also if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",9205 +"Peter Wang — Anaconda, Python, and Scientific Computing",https://www.youtube.com/watch?v=ZvMYCj-B_nA,3011,2021-01-21,"We have to be faced with this concept that technology is not value neutral. If you think about what machine learning really is, it is the application of massive amounts of compute, rent a supercomputer in the cloud, kind of massive amounts of compute to massive amounts of data that's even deeper and creepier than ever before because there's sensors everywhere to achieve business ends and to optimize business outcomes. We know just how good business are at capturing and self-regulating about externalities to their business outcomes. Just as a human looking at this, I would say, ""Wow. I've got a chance to actually speak to this practitioner crowd about if you're doing your job well, you'll be forced to drive a lot of conversations about ethics and the practice of your thing about what you're doing within your business as it goes through this data transformation."" You should be ready for that. Steel yourself for that. Don't punt. Don't punt on it. We can't afford the punt. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Peter Wang is the co-founder, CEO, and creator of Anaconda. He's been developing commercial scientific computing and visualization software for over 15 years. He created the PyData community, conferences and devotes his time and energy to growing the Python data science community and teaching Python at conferences around the world. Couldn't be more excited to talk to him. Maybe, for starters, I guess, I know of you because of your product Conda which I've used for many years, I have a feeling most of the people listening to this will know what Conda is, but maybe could you describe it in your words just to make sure we're all on the same page here? Yeah. Conda is a package manager that we built as part of the overall Anaconda Python distribution. It started as a way just to get people package updates of the binary builds that we do. It's expanded to then manage virtual environments, so we could have different versions of libraries and different, in fact, versions of Python in user land on any of the platforms we support. Then, we also created a space for people to upload their own packages. That's Conda. That's the anaconda.org service. Around that, a community has grown up called Conda-forge where they make recipes, maintain recipes, upload packages. But lots of other people like the Nvidia folks or like PyTorch, they will upload their own official builds into the anaconda.org system. We run all of that infrastructure. We pay the bills for the CDN and for all the storage and everything. Then, we do have a community around the Conda package manager itself, so people making tools and extensions for it. That's in a nutshell what Conda is. You think of it as like an RPM or something like that, but primarily for data science and numerical-oriented computing. What's your original background? Were you always making software running a successful software company? No. I've always been programming pretty much since I was like, I think, eight years old. I've been programming something. But I ended up going to college for physics. I graduated with a degree of physics. I decided to join kind of the dot-com, kind of boom by going and joining a startup. I've been in software ever since then. But I spent a number of years working in consulting using the scientific Python and the Python stack in the 2000s. That's really where I started seeing the possibilities for using Python for a broader set of data analysis use cases than just niche scientific and engineering computing kinds of use cases. Cool. Can you explain to me what was going on that you started this project and what the original conception was when it began? Yeah. Sure. Well, the original conception, so the company was called Continuum Analytics. I started that with my co-founder, Travis Oliphant, who is the creator of NumPy and one of the co-founders of SciPy. We put the company together to promote the use of Python and to advance the state of the art for Python for a broader set of data analysis needs. That was the original vision. At that time, this was 2012 we formed the company, Wes McKinney had just really started pushing pandas as a data frame library. The Jupyter Notebook was relatively new at the time it was still called the IPython Notebook. The world was sort of a wash in Hadoop big data craze. What we could see was that once people through all their data Hadoop they wanted to do bigger analyses. They want to do broader more kind of cross data set, cross schema sort of analyses. They would need tools like Python. SQL wasn’t going to do it for them. We were putting this stuff together. We were trying to find alternative MapReduce frameworks that were nicer to Python than Hadoop. The rest are kind of the Java, the Apache Java JVM big data stack, if you will. The JVM world does not play with the Python C++ native world very well. Any case, as we're looking at doing all this stuff, it became clear to me that if people couldn't install SciPy and Matplotlib and IPython, they were not going to be able to install any new fangled compiler tools we built or any newfangled MapReduce framework. It was just going to be completely off the table. We started by saying, ""Well, you know what? We should probably produce a special collection of packages, the distribution of Python, that helps people get started that includes all of the basic things they need, that works on Mac, Windows, Linux. That was the basic idea. We built Anaconda. I came up with a name because it's Python for big data. It's a big snake kind of. Nice. Although, of course, I don't like snakes that much. Python is of course named after Monty Python, but whatever, we'll ignore that. That's where the name Anaconda came from for that product. Then, that just took off quite well. We eventually renamed the company Continuum to Anaconda because we'd be at conferences. They'd say, ""Where are you from or what company are you with?"" We'll say, ""We're with continuum,"" and said, ""Okay. Yeah. That's nice."" We say, ""Well, we make this thing called Anaconda."" They say, ""Oh, we use Anaconda. We love Anaconda."" After that happens like the thousandth time, you figure out the world's telling you something so anyway. But anyway, that's the journey. Since then, we've continued to push new open source tools and things like that in the Python called data stack. It's incredible, the impact, that I think you've had and certainly NumPy and SciPy in terms of just making Python a popular product. Do you ever regret choosing Python for this? Has that been a good choice for you? Oh, no, no. That was completely intentional. A thing that people should understand, I think, especially as more software engineers move into ML and become ML engineers, for them, language is just a choice. It's like, ""Well, I'm a C++ coder now. I learned some Go. Now, I'm doing Python. It's like whatever. Python's got some warts, and it's got some good things. But the thing to recognize is that Travis and I when we started this, the reason why we wanted to push Python was because of the democratization and the access, the accessibility of it, when you're a software developer, you learn new languages all the time because that's part of your gig. If you're not a software developer, if you're a subject matter expert or a domain expert in some other field, let's say, you're geneticist or let's say you're a policy maker or whoever, you're astrophysicist, learning a new software programming language is hard. You're not really a coder anyway. You had to learn some Fortran or C++ or Matlab in grad school, but otherwise, you're not doing this on a weekend just because you love it. If you learn a language, this is going to stick with you for a while. If we as people who make languages or make software tools, if we can find a language that people like to use and it's powerful for them and that multiple different kinds of people can use, that's incredibly powerful. One of the things about Python is that the creator of Python, Guido, before Python, he was working on a project called Computer Programming for Everyone. Some of the ideas that went to Python came from that precursor language called ABC. That readability counts in that of executable pseudo code thing, the same things that make Python hard to optimize, to make it consternation for statically typed language aficionados, those things also make it incredibly accessible to lots of people. When we make these kinds of advanced tools available accessible to lots of people, what we do is we grow the universe of possible innovations. For me, it's very intentional that we chose Python. There's thousand new languages you could create that are better than Python and all these different dimensions. But at the end of the day, Python is kind of the language everyone uses. It is valuable that everyone uses that same language. I have a very, very strong opinion about the fact that we should continue promoting its use and growing its use even as I fundamentally believe there must be a better language out there. That's like the successor to it. I have some ideas about that as well. Oh, interesting. I'd love to hear about that because we were talking with one of the fast.ai founders, Jeremy Howard. He's written so much Python code. He was really emphatic when I was talking to him on the same podcast about Python can't possibly be the future of scientific computing. I was surprised. I would say my perspective is definitely a non-expert, but I really enjoy programming in Python. Maybe, it's hard for me to really see how things could be better or maybe I don't have to worry about performance as much as other people. But what would your take be like? Is there any kind of language of less adoption that you think is really intriguing and could replace Python or are there tweaks to Python that you'd like to see? How do you think about that? Yes and no. There are languages out there that do interesting things that are things that Python can't quite do or that Python may never be able to do. One of the fastest database systems out there is the thing called KDB and the language in it K, you're not going to find any... I mean it comes from like the APL roots which are the precursors to the Fortran stuff. Then, Matlab and NumPy and all these things. In any kind of ALGOL and modulative derived imperative programming language, you're not going to meet the kinds of raw numerical performance that K and KDB can achieve. The creator of K in KDB has a new thing that he's building called Shakti which is even more interesting. There's that kind of lineage of things. They're like the most out there amazing bits of Lisp plus like Fortran, and you get something like that. Python is not there, but Python has a lot of the good parts of the ideas there. It expresses them in a infix imperative language. Then, there's things like Julia that do hopefully- Sorry. Sorry. Let me try to understood what you said about K and other ones. What's the advantage of that they have the potential to be faster? It's more than just faster. It's a fast and correct and performant representation of your idea, but you have to sort of warp your brain a little bit to thinking in that way. Ken Iverson, the creator of APL which is the root of all of this stuff, he had this idea that notation is a tool of thought. If you really want to help people think better and faster and more correct all the same time, you need better notations. If you ever go and look at a bit of K, it looks different, et's put it that way. Then, what you are mostly used to in a Python or even a C++ or C or Java world, it's completely different. It comes with a different brain space. Interesting, is that just because it's sort of following different conventions or is there something to this perspective? Because I feel like every so often not in many years, but in grad school, I used to occasionally run across 4chan. It would just be like, ""Okay. I'm stopping here. I'm not going to go any deeper, this just feels impenetrable to me, but is that my fault or is that like... Yeah. Is there something there that's like better about it, I guess, in the notation? Well, better is a big word. I'll back up and say the difference between something like K or fourth or J kind of like JK fourth, APL versus ALGOL or like Pascal C kind of this lineage of fairly imperative procedural languages. At the end of the day, we are programming. When we write a program, we're making a balance of three things. There's the expression itself like what it is we're trying to express. There's the data, the representation of the data. Then, there's like some compute system that's able to compute on that data. I call this kind of the iron triangle of programming, is that you've got expressions and expressitivity or expressiveness. You have data, schemas, data correctness, things like that. Then, you've got the compute which is run time, again, correctness, runtime characteristics. Every programming system sits somewhere in the middle of this like ternary chart. Usually, you trade off. What happens is you usually collapse one axis onto the other, and you have a linear trade-off. Most of the post-Nicholas worth era of looking at, okay, you've got data, you've got a virtual machine and you're going to basically load data in and do things to it with functions that proceed like this. That model sort of everyone has in their heads as a programming system. When you look at something like fourth or like K, you actually come from a different perspective. Fourth, I'll throw that in there because even when you do have an explicit data representation in mind, when you write programs in fourth or if you ever had an HP calculator reverse polish notation, probably the closest, that most people will ever get to fourth, you're explicitly manipulating stacks. You're explicitly manipulating these things. You're writing tiny programs that can do a lot. It's amazing. That's what an explicit stack and explicit, these kinds of things. When you go to something like Lisp or like K, you're writing these conceptual things, these expressions. Well, in the case of Lisp it's a conceptual algorithm. In the case of K, it's also an algorithm, but it's an algorithm on parallelizable data structures on arrays and on vectors. Then, part of your first class thing that you can do is you can change the structure of those data structures. You can do fold operators. You're going to apply in these ways. You can broadcast and collapse and project. All of those are first class little things you can do in line as you're trying to express something else. You end up with a line of K that's this long, that would take a page of Java to do. By the way, the genius of the K system is that the underlying machine that interprets that, the compiler and then the interpreter system is incredibly mathematically elegant because there's actually fundamental algebra that you can sit in the heart of this stuff that you can then... Basically, K will load into... I think the claim is that it loads into L1 I-cache. Your program just streams to the CPU like a mofo. You're never even hitting L2. That's kind of an amazing thing. I think when you turn around, you look at something like Python which is not that optimized at all, it's like the C-based virtual machine, but when we do NumPy things, you're expressing some of those same ideas. Yeah. I was going to say this reminds me of my experience with NumPy where I keep making it tighter and tighter and shorter and shorter and more and more elegant. But then, when I need to debug it, I feel like I often end up just unpacking the whole thing again. I don't know if that's like me being stupid. Well, it depends on what you're debugging though because you can make it compact. Then, when you debug it, it's like, ""Are you debugging an actual bug in the runtime of NumPy itself?"" Are you debugging a performance mismatch with your expectation relative to how the data structure is laid out in memory? Are you debugging a impedance mismatch between your understanding of what NumPy is going to do in each of these steps versus what it's... There's a lot of things to debug, so to speak, but that's one of the downsides of making really tight NumPy snippets because I did some of that back in the day. I was like, ""Oh, this is so great."" Then, something blows up, but it's like, ""Oh, crap."" But wait. I'm like taking off in all these tangents. I'm actually really fascinated. This is a conversation. Totally. You were saying, so you're comparing to K which actually Jeremy Howard did talk about and really, really praised. Great. But then, what are the other kind of languages that have like interesting pieces that could be useful for scientific computing. Yes. Jim Gray, the late great Jim Gray wrote an amazing paper back in 2005 called Scientific Computing in the Coming Decade. It was prescient. It was ahead of its time, I think. It was at Jim's time. He knew it, but he was writing this great paper. It talked about how so many different things he talks about in this paper. It's worth everyone to read it. But he talked about how we would need to have computational sort of notebooks, how we need to have metadata indices over large data that would have to live in data centers that we couldn't move anymore. We'd have do computing. We have to move ideas to code... Oh, sorry, move code to data, move ideas to data, all these different things. But one of the things he explores is why don't scientists use databases? Databases is the realm of business apps and Oracle nerds. Why don't geneticists and astrophysicists use databases? The closest they get is using HDF5 which is really just like it's a file system. Great. It's a tarball. It's talked about lays out a memory, so you can compute on it. That's great. You can do out of core execution on it, but why don't scientists use databases more. He looked at this a little bit more, but one of the things I think that would really move scientific computing forward is to treat the data side of the problem as being more than just fast arrays. Actually, as we have more and more sensor systems that have more and more computational machinery to get to additional data sets which then become transformed into additional data sets, that entire data provenance pipeline even as businesses have to reinvent the enterprise data warehouse to do machine learning on all their business data, I think scientific computing has to honestly sit down and face this giant problem it's trying to ignore for a very long time which is how do we actually make sense of our data, not just some like /home/some grad student's name/temp/project five/whatever. We've got to actually do this for reals. I think one of the ways to move scientific computing forward that is on the completely opposite side of going to the K land and fast APL land is treating the metadata problem and the data catalog problem and, in fact, the schema semantics problem as a first class problem for scientific computing. If you look at what F# did with Type Providers and building a nice extensible catalog of schema that was actually part of your coding as you were using data sets. They did that 10 years ago. That stuff is amazing. That is something that we should make available. That's something that would be a game changer. I don't know if you saw this thing where some like internet council of like geneticists, they declared they would change actually gene names. Did you hear about this? No. There were gene names they changed from March 1 Sep 1, things like that because [inaudible 00:19:57] use Excel so much. When those show up in Excel data, Excel translates them into dates. It screws them up. Because of Excel's auto formatting of their things, they're literally changing the names of these genes. This is how depraved science has gotten, is that we will... Not that those are necessarily great names to start with, but the fact that we will wrap ourselves around a fairly broken tool for this purpose. I don't know. For me, handling the data and schema problem for science like full stop, that's a huge part of the problem that needs to be done. Yeah. That's so interesting. Is that something you're working on? No. I have a company to run. Like we have to make money! No. When we get to a certain point where we have the resources to invest in additional projects, then this is one of the ones I would absolutely try to tackle. We do have a project that's in this vein. It's called Intake. It's not the sexiest sounding thing in the world, but Intake is a virtual data catalog that lets you set up a data server. If you set up Intake server over here near your data and you fire up the client just in your terminal, on your notebook or whatever, you can connect to it, and you hit it with like remote Pandas and Dash calls and things like that. You can also create transformed almost like materialized views of those things on the server. It's been used in a few projects. Some people are starting to pick it up, but it's something I would recommend people check out. It's called Intake. Cool. All right. We'll put a link to it. Can you give me some examples of who your customers are. This is like such business speak. What's the value? Yeah. We have a couple of different things that we sell. For a while now, we've been selling enterprise machine learning platform called Anaconda Enterprise. It's based on Kubernetes, and data scientists can... It can stand it up. Data scientists log into it. They have a managed governed notebook environment, well, any number of different UIs, but generally, people prefer notebook environments. Then, they have one click deploy for dashboards, for notebooks and things like that. They can run machine learning models and have rest endpoints they deploy. It's a, yeah, a big data science platform thing. There's another thing we sell that is just the package server. A lot of the value that businesses get from us is that they have actual vendor backed this place to get binaries to run in their governed environments which actually does matter to them. In that situation, what they want to do is they have like a server. They buy a server from us that has the packages. Then, it's proxied locally for them. We don't get to see all the packages that they're downloading what they're doing with their data analysis. They also have faster access to all of these different packages. They're IT people. This is a really important thing. IT has a chance then to also govern which clusters, which machines, and which environments can use which versions of which libraries which is a really important thing because in an enterprise environment, you have data scientists who want the latest, the greatest and bleeding edge, everything. Then, you've got production machines which you do not want getting the latest and greatest everything. You want to know exactly which version, how many CVEs, which ones are patched. That's all that runs in production. This is a package server that gives business the ability to do that. Those are primarily our two commercial products. We'll be coming up with some more things later in the year. It's an individual commercial edition that individual practitioners can buy, things like that. You've been doing this a while like at least a decade. No. Not a decade. An octal decade. We started in 2012. Nice. Even that is quite a long time, I think, for this space. I'm curious when you started, what kinds of customers or what industries are using you the most and how has that changed over the last eight years? Yeah. When we started, it was very heavily in the finance, so hedge funds, investment banks, things like that. There was a heavy use of Python there at the time. We were doing a lot of consulting and training, open source consulting, standard sort of things like that. Nowadays, you see a lot of these venture-backed open source companies that have a product. It's like, ""Here's our open source foobar. Here's the enterprise foobar++."" Then, Amazon build a clone of it off their open source. They go public anyway, make tons of money. This is a play that many companies have done especially around some of the big data infrastructure projects. It's pretty popular move. We are an open source company that supports an ecosystem of innovation. There's a lot of things that are out there that we deliver and ship via Anaconda that we ourselves don't write. That innovation space has changed. It's gotten sucked into so many different things. Now, we've seen everybody, insurance, oil, and gas, logistics, DoD and three-letter agencies and just like everybody is using Python to do data analysis and machine learning. It's just literally everywhere, like sports betting sites, Netflix, and the Ubers of the world like everybody is doing this stuff. Now, not all of them are paying us yet as paying customers, but that diversification of well, I would say diversification, but that growth in adoption was we were hoping what we were hoping to unleash when we started the company. It's been really great to see all that happening. We couldn't have predicted deep learning. We couldn't have predicted that machine learning was the thing to take off. We were really thinking that it would be more rapid dashboards around notebooks, around building. Here's the data analysis. I'm a subject matter expert because I can write a little bit of Python code. I now can produce a much more meaningful, rich, interactive dashboard and control pane for my business processes or for my whatever like heavy industrial machinery. We saw that happening pretty well in the 2000s around a rich client toolset as sort of a Matlab displacer. But now, with machine learning on the rise, it's completely flipped Python usage into a different mode. That's, as you would know at Weights and Biases, like that's the dominant conversation on Python, but these other use cases are still there. There's still a lot of people using Python for all these engineering simulation things. Anyway, it's just been great to see all this growth and diversification of use cases. Is machine learning even the top use case that you see? It feels like the buzziest right now, but I always wonder what's the reality of the usage volumes versus what you see on the ground. It's the aspiration that people get paid for that way. I think there's a strong disconnect between older businesses. I would say Python has crossed the chasm. You talk about the chasm of technology and crossing the chasm. Python has crossed the chasm. On the other side of the chasm, the way that this kind of innovative technology has landed is that you have a lot of buyers who are not as sophisticated about what is they really want to buy or what it is they're buying or how ready they are as a business to adopt what they've bought. You can buy the fanciest like Ferrari, but if you have a dirt track road, it's not going to go as fast as you have like an actual smooth paved road. A lot of businesses have this problem where they can buy the hottest sweetest ML gear team tooling, blah, blah, blah, but then their internal data is just a mishmash. You spend 80% of your time digging that ML team out of the data swamp. That message, I think, people are starting to get it now as they come over into the chasm of the trough of, what, not despair, something- Disillusionment. Disillusionment. That one. Right. But the truth is this, there's an ML hierarchy of needs just like Maslow's. If you don't have your data stuff together, if you don't understand the domain problem you're trying to solve, you have no business even doing data science on it. If you haven't done data science, there's no models to go and optimize with machine learning. But if you get all that in place, then machine learning can absolutely deliver on the promise. I think people try to buy the promise, but most of the people they pay are out there slugging a bunch of trying to basically denormalize data, dedupe data, and just do a lot of that kind of stuff. Most of the most of the verticals that you mentioned, I think, are not the first things that come to mind here in Silicon Valley for ML applications, but you actually see like insurance doing ML and thinking of it as ML, just as a specific example. Oh, absolutely. The hardcore finance folks are probably the only people, I would say, that lead Silicon Valley in terms of ML. The hedge funds were there first because they operate in a pure data environment. The thing about that data environment is everyone else is operating in the same pure data environment. By the way, it's all zero sum. If you screw up by a millisecond, you lose millions of dollars. Incredible incredibly hard odds or hard boundary conditions to be optimizing in. I think Silicon Valley, it's a lot of consumer behavior. It's a lot of like this kind of thing. Certainly anything in ad tech and the attention economy, the ML there is fairly low stakes. Of course, hundreds of billions of dollars of corporate valuation hang in the balance, but if you screw a little bit of something up, it's like, ""Well, they'll be back tomorrow doomscrolling."" We'll give them some better content tomorrow, but when you're in insurance and these other things, the ML, those models, the kinds of diligence that a credit card company has about honest models and model integrity, the kinds of actuarial work that goes into building models at an insurance company, that's real. There's real hard uncertainty. If you screw up, that's a $100 million screw up. There's real stuff happening there. There are no light weights on this stuff. They're doing real things. Cool. I guess when I've talked to insurance companies, it's felt like there's almost these two separate teams that feel a little bit at odds with each other. They're like the old-school math, guys, like the actuaries who are like, ""What is this? We've been doing ML forever. This is just a re-branding of the stuff we've always been doing."" Then a couple guys off to the side maybe doing some crazy deep learning projects that you wonder how connected they are to the business. Do you feel that same dynamic or - Oh yeah. Absolutely. Any organization over like 50 people is a complex beast. Even 50 people can be pretty complex. These larger firms, there is definitely a struggle internally as they do this data transformation into the cybernetic era is what I've been calling it, the cybernetic era. Many of them, the theory of action, is still open. It's like, ""Whoa, we sell this particular insurance policy. We'll see what comes back five years from now."" We'll look at a five-year retroactive performance, and then, we'll know if the model is correct. Those kind of old guard folks who are... Yeah. A bunch of actuaries writing a bunch of SAS code, that's some old school stuff. Then, there are new people in that space who have access to the data who have the statistical background and who know they can do way better. There is a conversation happening. Within credit card companies, you'll have... They're a great example because there's regulatory pressure. There's old school models and SAS. There's newer people trying to do some better credit models. There's really cutting-edge people doing real-time risk, real-time fraud, all these kinds of things using deep learning sometimes using all sorts of GPU-based clusters. You just see a whole pile of different things within a credit card company that you might not see it still. In Silicon Valley, it's been more monoculture because there's less tech overburden that they had to dig out from. There's like, ""Well, right there's like, ""Well, we need a bunch of machines in the cloud. You got it because there's no regulators checking into this stuff."" Yeah. What do you make of, I guess, the Matlabs and the SASes of the world? Is that ever a sensible choice for someone for their tech stack or is that just a completely legacy software choice? Well, let me see here. I think the best way to answer that is that any time we make a technology choice, we should be very respectful of Conway's law which is that the technology systems that we build, the software systems we build are a reflection of the communication patterns within the teams that built it.",5738 +"Chris Anderson — Robocars, Drones, and WIRED Magazine",https://www.youtube.com/watch?v=naGcNdqOq7k,3807,2021-01-14,"So how do we make it so that more people can engage with self-driving cars without working for Google or Waymo or whatever? And the answer is you take the essence and you reduce it to a unit that anybody can have access to, exactly as we did with drones. I didn't have a Predator, so I made one out of Lego and foam. And I didn't have a self-driving car, so I made one out of toy parts and a Raspberry Pi. And so what you're seeing is this incredible diversity of people who are engaged. You're listening to Gradient Dissent, a show about machine learning in the real world. And I'm your host, Lukas Biewald. I knew Chris, originally, or knew of Chris as the Chief Editor of Wired and the author of The Long Tail, but it turns out that, in the last decade, he's gotten super into drones and started a company around DIY drones and now works on 3D robotics and DIY robot racing. So it's a super interesting conversation and I think you'll enjoy it. Chris, it's such an honor to meet you and you have such an interesting kind of arc to your career. Before we get into the stuff you're doing now, could you kind of tell us about the highlights of what you've done before the robot stuff? Sure. It looks really chaotic and random, but every step made sense at the time and possibly, if I do my job here, I can make it make sense in retrospect. So I was a terrible student, essentially failed out of high school and failed out of college, then played in punk rock bands for most of my 20s, working as a bicycle messenger. Wait, is that right? Yeah. Really? Yeah. The best story is that I was in REM. No. Wait. Really? Wow. Well, there's a little bit of a footnote to that, which is not the REM. No, I was in a band called REM in Washington, DC. We were really good. We were about to release our first album and our manager said, ""It's the weirdest damn thing. There's this other band called REM, but they're releasing their album on the same day, but don't worry. They're from Athens, Georgia. How good could they be?"" And so we thought it'd be really funny that we, sort of the famous, big city REM, would invite the little country REM up to Washington, DC for a battle of the REMs and the winner would get to rename the loser, which they agreed to. And they came up and we played a joint record launch party. Wait. So this is early '80s. Right? Yeah. I guess probably about '83-ish. '83? Maybe '85-ish, something like that. And so we got the famous 930 Club in Washington, DC. So we played and then we flipped a coin to see who goes first. We went first and we played good sets. We got decent applause. We went to the bar to celebrate our inevitable victory. Then they came on second and their first song was Radio for Europe, which was their first single. Yeah. And our jaws were on the floor and our beers unfinished and we realized we were completely sunk. And they were great, as you might imagine. Won, as you might imagine. And Mike Mills, the bass player, stayed around just long enough to rename us Ecoslavia because we were so arrogant to think that we would win. And we released our album under that name and the rest is history. Anyway ... Wow. That's amazing. Yeah. So that was my little sort of brush with fame. But yeah, so complete fuck up student. Oh, sorry. We have to beep that. But eventually, in my last 20s, I was like, ""You know, I don't think this punk rock thing is really working out for me. I should probably use my brain again."" And as a teenager, I'd been thrilled by physics and I got the Feynman Lectures on my 16th birthday, which was ... I read those instead of going to school. And so I went back to college and I decided, at this point, I had so much to prove that I was going to do the hardest thing possible, which was physics. And so I got a degree in computational physics, which was a new thing at the time. And the job of all of us physicists, those days, was basically to understand the nature of matter as we go closer and closer to the Big Bang, so higher and higher energies. And that means bigger and bigger and more expensive particle accelerators. And we were all sort of queued up to work on something called the superconducting supercollider, which was going to be in Texas. And the problem is the cost of the collider kind of scales with the energy it produces and went from $8 billion, to $16 billion, to $19 billion and then Congress canceled it. And that was it. There were no more interesting experimental facilities in the United States and it was all going to be queuing up for CERN, the LHC in CERN, Switzerland. And I realized I could see my career. I was going to be an assistant professor at Iowa State, waiting for my experiment to run at CERN. And 20 years later, it would run and it would probably fail and I would be author 300 on a paper about an experimental failure. And I was like, ""That sucks."" And I wasn't even very good at it. So it just was time to move on. So I went to the adjacent space, which was the science journals, Nature and Science, to write about science rather than be a scientist. And then went from there to The Economist to lead their tech coverage. One of the things that we learned from that generation of physicists who just, basically, their careers vaporized with the SSC, was that, although physics was not going to be our future, we had accidentally created the internet, as physicists. So the internet, as you know, was created largely at the Link Research facilities. The web was created at CERN, a physics lab. And we, as physicists, had the only big data out there. We were the only people doing big data because we had all this data coming from the particle accelerators. So we had these skills, big data and internet. And so when this generation vaporized to the winds, most of them went to Wall Street to become quants, which was the next source of big data. And the ones who didn't do that went to sort of create the emerging internet industry, which is kind of what I did as a writer and sort of kind of moved media onto the internet. Before you go further, getting a PhD in quantum physics is no joke. I have a lot of friends who feel stuck in academia and have trouble getting out. Even though the careers available to physicists, for example, are quite good, I think most people feel a little bit like failures because, inside of it, you're so funneled through this escalator to success. I mean, you speak so rationally about it, but actually I feel like most people aren't able to make that leap. Do you think it was your perspective of having not jumped from undergrad to grad school or something else? Yeah. I mean, to be clear, I don't have a PhD. I dropped out of the PhD program. Not even at ABD. I didn't even get that far, but if you love physics, it's kind of heartbreaking to see what's happening now. So you're inspired by the greats, but like all scientific disciplines, you need theory and experiment to be matched by a limited amount of time. So a theory comes out and you want the experiment to be able to falsify or not within, say, five years. If that gap grows, then theory becomes unmoored, In reality, and it becomes almost like poetry and, now, it's the coolest theories and the ones that are best told or the ones that spark the imagination. And it's almost like metaphysics. It's no longer physics. It's almost philosophy. And that's a really weird place for a scientific discipline to be. And I think that the people who stick with it and all they can really do, they can either line up for an experimental facility and see you in a generation or they can go into theory. And it's seductive in that it's math, but it's not real. And I think you can really get lost there. It's almost religious. There is a slight ray of hope, though, in cosmology in that, rather than having physics facilities, terrestrial physics facilities, that we use the stars to create energies and to observe So we're getting much, much better at using astrophysics as an experiment, but it can't do everything. So I'm not sure I answered your question, but basically, if you fall in love with physics, what you get is a really good grounding in statistics and math, but it's not a great career and it's probably best to use that grounding. And there's plenty of physicists out there doing good work in machine learning and elsewhere. So that's why my degree was actually computational physics, which in retrospect was more about compute than it was physics. I see. Interesting. But then you sort of left that whole thing to be a journalist. Well, yeah. I mean, again, it was stepwise. My parents were journalists, so it was kind of like the one thing I was sure I was not going to do was journalism, but writing about it for ... but again, Science and Nature are scientific journals. So it wasn't like grubby newspapers. And then The Economist, again, everybody was a PhD of one sort or another. Most people were. So it really felt like you were part of extended academia. Then moving to Wired and to take over Wired in 2001, that was the first leap into traditional media owned by Conde Nast, which owns Vanity Fair and New York and things like that, so traditional media, but they bought Wired. They hadn't created it. And Wired was created largely as an evangelical bible of the emerging internet. And one of the reasons I'd left science was because, in '93 when Wired was launched and the internet was just forming, I wasn't sure what it was. Again, we thought it was just a way to telnet into the cray at Los Alamos. And then this magazine comes out with these dayglow colors saying, ""No, this is a cultural revolution. This is going to change the world. This is going to change everything."" And it just blew my mind. I suddenly realized this thing I was kind of good at actually had these big implications. That dictated the direction of my career. And so when the opportunity came to lead it, I was like, ""Yeah, this is the religion I believe in."" Well, that's funny because 2001 is a really interesting year. Right? Was this pre bubble collapse or post bubble collapse? No, post. It was the best and the worst time. So the bubble collapsed in March of 2000. 2000. Yeah. Got it. Right, right. So at that point, most of the world was saying, ""This is a sub-prime mortgage. This is a hoax, perhaps even worse,"" but you had to believe that the internet was not the stock market, that there was something real at the core of the stock market and that the bubble was a finance artifact, but the underlying trends were real. And it was a very unpopular and somewhat minority view, at the time, that the internet was real, but if you were to bet at that time, as I did, that the internet was real and the stock market thing was a stock market thing, then you're buying at the bottom, essentially. So I don't think I would've been offered the opportunity to take over Wired if everybody knew it was the hot place to be and I wouldn't have been able to hire the people I did. And as I say, the best time to take over when you're not particularly experienced with this kind of stuff is at the bottom because you can hire people. Your lack of success is cloaked by the market's lack of success. It's impossible to succeed in that environment, so no one could tell whether your failures are yours or exogenous ones. And then, thirdly, once things start to pick up again, year-on-year growth looks amazing. So it works out really well, but it was a very countercyclical bet, at the time. And if you look back at underlying internet adoption trends, you almost can't see the bubble bursting. It was really isolated to the stock market and all that capital created a huge amount of infrastructure, which we still enjoy today. Interesting. So what about The Long Tail? How did that come about? Yeah. Thank you. When you take a physicist, basically, by heart and you stick him in media ... I'm not trained as a writer or as an editor, didn't have any particular interest in media. What I was really interested in was the story, but I'm a nerd. So what I'm going to do is I'm going to try to do research about the story. And so not trained as a journalist, I was trained as a data analyst. And so I was like, ""Well, something important is going on in the server farms of Amazon and Netflix. We can probably see it as a lens on human behavior in a way that we never have before."" We're basically instrumenting society in a way we never had before. And this is obvious today, but it wasn't at the time. And I said, ""I bet, if I could get that data to see how consumer preference actually looked at scale, I bet it would be interesting. I bet we'd learn things that we weren't seeing with the ..."" I don't know, Department of Commerce reports or the Walmart quarterly earnings. And so I asked. I asked the Yahoos and the Netflixes and the Amazons for their data and, weirdly, they had to sign a couple NDAs and anonymize some stuff, but they gave it to me and I just got these massive datasets. I did really dumb stuff. I just stuck it into a spreadsheet. And you had, basically, sales of a set of products. Take music, for example. You get a million tracks and then you rank them in terms of popularity. And I stuck them in a spreadsheet and nothing showed up. The graph was empty and I was like, ""Wait, what happened here?"" And I said, ""Well, let me just cut off the first hundred and just graph from 101 down to a million."" And then I could see the line and I realized what happened is that the inequity of the marketplace, the incredible scale differences between the number one track and the number one million track basically compressed my scale. So the scales are set by the number one. The Y is set by the number one and the X is set by the number million. And so the line was basically just right all along the axis. And until you cut off the head, you couldn't see the tail. And it was simply that dumb thing that I did one night with a spreadsheet that kind of created this, that just shifted my gaze to the right. And I realized there was a lot there that we weren't paying attention to because it was a high number, but low magnitude. And that created the notion of The Long Tail and then got the other datasets and they all confirmed that, if you have basically infinite inventory and mechanisms for people to explore that inventory, that consumer preference shifts down the tail. Not entirely. We still have network effects and hits and things like that, but basically there's a lot of suppressed preference for niche stuff that was suppressed by the scarcity function of shelf space that was opened up by the non-scarcity by the abundance of online databases and ecommerce, et cetera. What's interesting is the caveat to that story is that I got a bunch of datasets and then, about a year later, AOL was also sharing some data and shared with some academics and somebody figured out that you could de-anonymize. Oh, the search data. The search data. Oh, I remember that well. They could de-anonymize the search data. Yeah. It was a shit show. And as a result of that, all the companies stopped sharing data. So there was basically a 12-month period where you could do the work I did. And internally, companies do it all the time, but externally, you can't get the data anymore. I feel like you named this really important phenomenon. Right? I think it's still called this today. It seems like there's a real skill in ... You just nailed something that's so important. So I have to confess that ... Yes, I did kind of come up with that name and it turns out that, actually, that phrase has been used before. People talk about fat tails, et cetera. Fat tails. I heard it. Yeah. Yeah. It probably has been used, but I think I called it ... I think I at least invented it in my own head, but I didn't think it was a big deal. It was slide seven in my presentation. And I went to see Reed Hastings, the CEO of Netflix, and kind of walked through my presentation, walked through my analysis with him because they'd helped. And he got to slide six and he says, ""There's your headline right there."" And so Reed Hastings was the one who actually identified The Long Tail as being the mot juste, if you will, that captured it. Wow. And I guess it's funny. I feel like that's maybe the thing you're best known for. Are you kind of sick of talking about it? No. You know what? I'm not doing the act of research in it, but no. What's interesting is that any sufficiently novel idea will separate the audience into two halves. There's those who say, ""No way,"" and those who say, ""Duh."" And it almost goes generationally. So anybody who grew up on the internet was like, ""Duh. Of course,"" lots of products, lots of choice, lots of niches are a thing. And anybody who kind of grew up before that ... And I don't mean to be ageist, but it's kind of cultural age, if not chronological. There's a lot of people who grew up, culturally, in the era of Blockbusters and Top 40 radio and three TV channels, et cetera, who basically argued that the Blockbuster was forever and that The Long Tail was a mirage that probably wrongly gave hope to niche artists that they could somehow work. And of course, they're right. And it was clear. I never said it was the end of the Blockbuster. I said it was the end of the monopoly of the Blockbuster. It was clear that the economic rewards would be felt largely by the aggregators rather than the creators. And the cultural rewards were felt by all of us, of course, and the creators obviously, take music or writing or whatever, there's certainly some psychic rewards of being listened to or read, but the fact that the internet exists doesn't mean that a struggling musician is going to be any less struggling. So I think there's a lot of people who just kind of read it as only, ""Blockbusters are dead. Therefore, The Long Tail is wrong."" And they still say that. And then there's a lot of other people who feel that it's completely self-evident. One of the kind of tragedies is that I wrote the book before YouTube existed. And YouTube, of course, is the canonical long tail marketplace of all cultures and niches, et cetera. And so, on one hand, it's kind of weird. I still have academics who show me people really don't understand the math of The Long Tail and they keep saying percentages. It's like, ""Well, the top 1% of X still has 90% of the ..."" They don't realize that it's 1% of 100 million. In absolute numbers is a lot, but still. And I still get this all the time from academics who, like, ""The Long Tail's a hoax because top 1%,"" cetera. Meanwhile, anybody who ... I should be able to say, ""YouTube. Discuss,"" but for some reason, some people just don't want to see it that way. So I do end up still trying to find evidence of it. Actually, it was a lot less controversial than my next book, Free, which was the economics of free stuff. And obviously, economics is largely focused on monetary economics, and yet there's obviously a non-monetary marketplace out there, as well. I mean, we're doing it right now. You don't charge your listeners for this and I don't charge you for this. We're doing some exchange, some non-monetary exchange that has value, but economists don't know how to measure it. So that one was actually much more controversial. Interesting. What was the controversy? Did you feel like you got a lot of negative feedback? Yeah. I mean, especially from media. I have kind of a love hate relationship with the media, which is increasingly becoming a hate hate relationship, but the newspaper business was imploding and they largely believed that the canonical error that the newspaper business made was putting their content free on the internet. And had they only set up paywalls at the beginning, that somehow media would be preserved. And people in media take themselves pretty seriously. They feel like they're the fourth estate and protectors of democracy and the only people who can keep us from the mob, et cetera. And so they believed that free content on the internet was destroying this foundation of democracy and that I was helping. I was not helping, if you will. Okay. So what happened next? Then you got into drones? Yeah. So running a magazine by day, but I was still a nerd by night. So my first nerd thing was The Long Tail and the statistical analysis and writing books. They were largely economic books because, even though I'm not trained in economics, my time at The Economist sort of osmotically gave me some exposure, but still basically I'm a programmer by heart. And as my kids got older ... I've got five kids and my wife's a scientist, as well, and we tried to get them interested in science and technology. As they got older, I was thinking of cool things to do with it and I actually started a site called Geek Dad, which is all about- Oh. You started Geek Dad? I know Geek Dad. That's awesome. Yeah, although I think Geek Mom is actually doing even better right now. Nice. Also a spinoff this. So I started Geek Dad. And largely, the notion was stem projects that were sort of fun for the kid and fun for the adult because there was a lot of things that were fun for the adult and not fun for the kid or fun for the kid and not fun for the adult, but the ones that kind of got it exactly right. So in the course of doing that, I was like, ""Robots. We should probably do something with robots."" And the kids are like ... So Lego, I was on their advisory board and Lego sent me the first Lego Mindstorms. Whoa. Man, wow. That's awesome. It was pretty cool. So they sent me the first Lego Mindstorms, like beta testing. And so I showed it to the kids and the kids were like, ""Yeah, we'll do it."" And so you follow the instructions, you put it together and it takes all morning. And then you have a little wield robot that'll kind of move towards a wall and then back away. And the kids were like, ""Are you fucking kidding me?"" No, sorry. Definitely bleep that. They did not use that kind of language, but internally, whatever the sort of nine-year-old equivalent of that is. And I realized that Hollywood has ruined robotics for kids because you've got transformers and this incredible stuff. And meanwhile, real robots just don't, at least most of them, don't really do anything. You're talking about Roomba, et cetera. So the gap between the Hollywood version of robots and the prosaic reality was such that it was really hard to get them excited. So I thought, ""Well, what would be cooler than a rolling robot?"" I thought, ""A flying robot."" And so I'm like ... I don't think I actually know what a flying robot is. Astroboy or something, I'm not sure. So I literally googled, ""Flying robot,"" and the first result was drone. And I was like, ""Huh, I hadn't thought about it. I guess a drone is a flying robot. Wait, what's a drone?"" So I googled, ""Drone,"" and a drone is like a- Wait, what year is this? This is hard to imagine. This is '96, '97. Got it. Sorry, sorry. 2006, 2007. Okay, wow. So drones are not in the zeitgeist yet. No. Well, drones were in the zeitgeist as a military thing. I see. Right. But there were no consumer drones. You couldn't buy one. I know, I know. It seems so crazy now, but at the time, drones were like a predator that shot hellfire missiles, et cetera. It was really a purely military thing. Right. So I googled, ""What's a drone?"" And a drone's basically a plane with a brain. It had an autopilot. And I'm like, ""Okay. Wait. What's an autopilot?"" And you googled the autopilot and it's basically sensors and compute and it figures out which way is down, which way is up, GPS, et cetera. Those sensors and that compute, that's kind of what we have here in the Lego Mindstorms box, which came with accelerometer and magnetometer and gyro, et cetera. And I was like, ""Let's just do it."" And so, right on the dining room table, we built an autopilot out of Lego, stuck it in a radio-controlled airplane and it kind of almost worked. And the kids thought that was mildly amusing for about a minute and I was blown away. I was like, ""What just happened? Did we really just build a drone with children on the dining room table out of Lego and it worked?"" Wait. Can I ask you a very ... Just having messed around with drones quite a bit, I feel like you're skipping over the part where the thing keeps crashing and breaking and then you spend an hour putting it back together and it crashes and breaks again. It's something maddening, right? Oh, no. Yeah, yeah. I just told you the bit that got me excited, that put the idea in my brain. The next five years were just horrible, but I couldn't let it ... So basically, what had happened in 2007 was a bunch of things that, in retrospect, seem obvious, but in 2007, it was the beginning of the maker movement. So it was 3D printing, it was Arduino came out, but what it really was was the launch of the iPhone, 2007. So what's in an iPhone? A bunch of things, but including our MEMS sensors, these sensors that were chips. And previous sensors were mechanical. A gyro was literally a mechanical gyroscope and it was just unaffordable, unattainable. And so I call this the peace dividend of the smartphone wars, but basically the components of an iPhone had now been so cheap and available that you could then put them together in different ways and explore adjacent space. So a Fitbit ... Well, the Wii controller, for example, was an accelerometer, a MEMS accelerometer. The Fitbit guys got a Wii controller. And just like I got a Mindstorms set, ""Huh, what else could I do?"" They got a Wii controller, opened it up, saw the accelerometer and thought, ""What else can we do?"" And they came up with Fitbit. And so there's a bunch of people who were looking at the components that came out of smartphones and thinking, ""How do I recombine them to create something new and transform an industry?"" And so that's what we did. We basically, rather than drones which had been aerospace-grade stuff, so you'd basically take an airplane, subtract the pilot, we're like, ""Take a smartphone and add wings."" And that bottoms-up approach was completely radical and transformative and initially was horrible. I mean, nothing worked. They crashed all the time, but because they were small and foam and cheap, nobody got hurt. Right. And because they were small and foam and cheap, we could actually build a community and we got tens of thousands of people contributing and beta testing for all the right reasons. And we innovated, collectively as a community, innovated super fast so that we went from Lego, to foam, to plastic, to basically dominating the drone world, including becoming the biggest drone producer in North America five years after that with no funding. That all just happened and it just kind of exploded out of nowhere. It was kind of like the way the internet took over the telecom sector or PCs took over compute. This is a bunch of amateurs with open source software and hacked together stuff, basically took over the future of aerospace with classic Gandhi stuff. First they ignored, then they laughed, then they fought, then they lost. And today, it's pretty evident that the future of aerospace looks unmanned, it looks electric, it looks more like Silicon Valley than it does like Boeing or Airbus, just like SpaceX did to the Launch Alliance, kind of the tech, the Silicon Valley drone model seems to be the future of aviation everywhere. That's so cool. And then, wait, how did you get into racing autonomous robots? Right, right. Okay. I started with the hobby, to industrialize my hobby. The drone community turns into a company, the company gets big, and now I'm running a company, which is all well and good, but again, still nerd. Still wanting to get my hands dirty. Drones, at this point, this is now 10 years on, so this is ... What year are we in? This is 2017 or so. So at this point, drones are kind of a solve problem. It was really hard for awhile, the common filters and building robust, reliable systems and connecting to the internet and the data payloads and the computer vision, all that kind of stuff. It was really hard for awhile, but now it's kind of solved. And I'm always looking for some unsolved problem, something that's challenging. And you would think that drones, as a 3D problem, would be harder than cars for a 2D problem, but they're not. And the reason being is that you can get away with all sorts of slop up there in the air. The air is largely empty. You have GPS. And so we didn't really care whether we were a meter off. We just basically had a GPS. Position pose were kind of given to us. Pose was given to us by the autopilot. It was hard to get there, a lot of work to figure out where down is in an inertial frame. And then position is given to us by GPS, just for free. So you can't assume that you have GPS with a car. You often don't. So you need to establish position some other way and also you need a kind of level precision that's a centimeter or less because there's lots of clutter on the street. And so it basically became a computer vision problem. So drones were an inertial problem, basically. Cars are a computer vision, deep learning problem. And computer vision, deep learning was just less advanced than classic control theory. So it was an opportunity to go deep on computer vision and deep learning and kind of get my brain going again. And once again, DIY drones led to an industry. So we're like, ""What should we call it? DIT robo cars,"" because, I don't know, we hadn't figured out a name for autonomous cars yet. I went with robo cars. ""And let's do it again. This time, I'm not going to screw up my hobby by turning it into a company. I'm just going to leave it a hobby, but let's get this flywheel going again."" And once again, we had the enabling technologies, which were finally ready. We had good compute in forms like ... We started with Raspberry Pis threes and fours and then Jetson Nanos. The job is to keep is affordable, democratize the technology. So we put a limit of $400. Nothing could cost more than $400. This is kind of what it looks like. Oh, no way. Yeah. This is a variant of it, but this is just basically an RC car chassis with a ... This one happens to be a Raspberry Pi Four on the top. Oh, nice. And a camera. This one also happens to have an Intel RealSense, a T265, which I'm playing with right now, but basically that's all you need. You need a camera, you need a Raspberry Pi, and you need an RC car. Ooh, can I show you ... I've got one that I ... Well, this is not yours, but I made one that's kind of similar. Yeah. It's Raspberry Pi Three here with a similar camera. I guess the chassis is a lot crappier than your chassis there. Well, this one's actually not that good. I probably fiddled with it a lot. I added a wheel encoder. Sorry, to nerd out a little bit, the Intel RealSense T265 is a really interesting sensor. It gives you basically position. It's a visual slam sensor, so it gives you position, but it's a lot better when it has an encoder, when it's matched with an encoder. It actually knows where it is. So it's doing it all visually with IMU and stereo vision, et cetera, and it records what it sees and then records that as you follow a path and then tries to replicate it again, all visually, but it can tend to drift over time. Wait. Can I see that sensor again? Yeah. It's this one right there. So that's two cameras? It's two cameras and an IMU. What's an IMU? An inertial measurement unit. Oh, I see. It's a combination of accelerometers, gyros, and a magnetometer that gives you a position. Gotcha. So your phone has one in it. So this one's using a framework called Donkey. So DIY Robocars is the community, but the actual project that is mostly used it called the Donkey Car. I would call it an MVP of self-driving cars, which is it's end-to-end deep learning and it works in the real world, it works in simulation. The basic model is behavioral cloning. So what you do is you drive with the PlayStation controller. You drive it around a track and it records the video, samples the video with stills as it goes around and then matches those with the inputs from your controller. And so you now have a pair. You have, basically, here's what this camera saw and here's what the driver did. And you send them out to the cloud and you run TensorFlow or whatever Fast.ai, whatever you're using, and you come back with a model inference layer. And then the model runs locally. So we train in the cloud or on your PC and you create a model, then the model runs locally, and then you switch into auto mode, and then it drives by itself by simply doing what you did, more or less, in the training session. So you just drive around three or four laps, maybe go clockwise, counterclockwise, little domain randomization. And it should learn how to drive. Now, that's one technique. In the physical world, that's the easiest way to train it. In our virtual environment, in our simulator, you can use different methods. So there, we use things like reinforcement learning and we give it reward functions and all that sort of thing, where during COVID, it really pushed it towards simulation. The exact same code works. It doesn't have to be on a physical card. It'll work on your laptop and you're running in a Unity-based simulator. And so it's been a really good time for us to push hard on our simulation side of the equation. And one of the questions we'll have, as COVID ends and we return to physical races, is how well do our models translate to the real world? Our sim to real gap. Yeah. Always a challenge. Exactly. And so we're working pretty closely with Unity right now to try to figure out how to improve the probability that our simulated-created models will translate well. And so we think a lot about domain randomization, but one thing ... It's hard to remember, but this car, that camera's 12 inches off the ground. Try putting your head 12 inches off the ground and try to see whether you can detect anything. Everything's so distorted and reflections and shadows, it's really hard to see the world from there. And so what we're trying to do is we're trying to ... Simulators are too perfect. It's perfect information. We can create any level of resolution. They don't have motion blurred. So we're actually trying to make this simulator worse. And one of the problems we have here is that you'll train on the track and, on your own, it works great and then, during the race, the crowd comes and now you have spectators all around the track. And now, you have all these legs and it completely throws off the model. And so we're actually modeling people and randomly putting people around the track to train model to ignore that. And we're trying to figure out, what is it really looking at? Which color channel, what contrast? What do shadows do? And we're trying to understand better how to robustify the model to do this into real well. Man, what a cool project. I have so many questions. Is it in the scope of Unity? I should probably know this, but I really just don't. So I think of Unity as a graphics company. Does their engine also model physics? Yeah. They've really ramped up the robotics side. So you think of them as a game engine. And of course, they're good at that. Competing with Unreal, they're kind of open source and Unreal is less so, is not, I guess, but they're really pushing the robotics side. And yes, they use physics. So they use the Nvidia physics engine in the background. Cool. And so it's quite good. And they have a whole team right now focused on robotics. They were initially focused on things like segmentation classification. So let's say, for example, you want to model a factory or a warehouse or the shelves of the 711, et cetera. How do you identify an object that's a carton of milk? rotated, bad lighting, how do you make sure you can identify it well? And so they focus a lot on that, just sort of taking objects and then sticking them in virtual environments and just creating a lot of noise and train the system to understand that. They're also used a lot in full-size self-driving cars because they create beautiful photorealistic environments and that's important, as well, but what we're working on with them is video. I mean, yes, we screen grab the video, but the image moves. And so there's a correlation between the previous image and then the next image. So that includes things like motion blur because our cars go really fast. They go probably 20, 30 miles an hour, but scale speed is 150 miles an hour. And when your camera's a foot off the ground, it is a lot of motion blur and things like that. So we're starting to model that. We want to procedurally generate tracks so that we can do domain randomization with tracks, make sure to give the tracks certain parameters that at least don't break the physics. So one thing you could do is you could create a virtual model that can handle any track, but in the real world, you've got things like physics, like the traction budget of your wheels, et cetera. So we have to model at least some physics of the tracks realistic. And basically, your training, you want to be able to say, ""Here's my model, here's my code, here's my hyper-parameters,"" whatever, stick it into the simulator, ideally in headless mode, so just running in the cloud, and I want you to run a thousand iterations and then I want to turn randomization on. So I want you to do a thousand iterations of randomizing lighting, shadows, motion blur, objects that are surrounding, textures. I want you to go through, randomize the courses, as well. I want you to go clockwise and counterclockwise. I want you to change which track you're in at any point. Then I want you to add other cars that are also random. And so when you think about that, when you think about the industrial scale of just scenarios you can create, it gets really exciting. And so that's where Unity is focused right now. Cool. What's your hope for this? Is it the joy of making something or is there- As you know, one of the rules of the maker movement is you never ask why because the answer is always because we can. My personal thing is that it's just really engaging. It gives me a reason to explore the cutting edge of machine learning and data science and things like that. So I need a reason. I'm like probably you. I can only learn by doing and it gives me a reason to do it. As a community, our nominal reason is to democratize the technology, to basically ... I don't have a real self-driving car. You probably don't have a real self-driving car. And that ain't right. Man, well said. I love it. Yeah. So how do we make it so that more people can engage with self-driving cars without working for Google or Waymo or whatever? And the answer is you take the essence and you reduce it to a unit that anybody can have access to, exactly as we did with drones. I didn't have a Predator, so I made one out of Lego and foam. And I didn't have a self-driving car, so I made one out of toy parts and a Raspberry Pi. And so what you're seeing is this incredible diversity of people who are engaged. We have virtual races every month. Two races ago, the number one winner was Japanese ... I don't know what he does, but let's imagine just Japanese engineer. Number two was French teenager. Number three was a 12-year-old Indian girl from Canada. And then down the line are University of San Diego professors, retired people. It's just incredible diversity of people who can participate because, if you do it virtually, it doesn't cost anything. It's just download some code and run it. And so we're really feeling like we're opening up the excitement of the industry to people who, otherwise, wouldn't have access to it. And some of them are doing it for fun, some of them are doing it to get smart on ... a tangible reason to learn machine learning, and some of them are doing it because they want it to be their next career. So we find we have a lot of people who are mid-career. They're an engineer, whatever, they've got a job, but it's not exciting for them. And this is super exciting. And so it gives them the chance to sort of fall in love with tech again. And what are the axis that you can change stuff? I think one of the challenges with these simulations is it kind of constrains the hardware a bit. Doesn't it? How do you think about that? The axis that we don't really mess with are things like cost and danger. So we like to keep them small, we like to keep them cheap. I mean, there's some exceptions and I can get into that later, but by enlarge, it should be something you can do indoors, it should be something that, if it goes wrong, nobody gets hurt. So that's where we limit. Beyond that, there really aren't any constraints. So for example, there are a lot of ways via self-driving cars. There are a lot of sensors that are available. So one of the things that's gotten super interesting of late is that 2D LiDAR has gotten really cheap. I have one of those, yeah. So you can get 2D LiDAR now for about 80 bucks in a range of about 10 to 12 meters. So we can explore that. Right now, we just used LiDAR for obstacle avoidance because our courses don't have a lot of structure and they're basically just white lines on carpet or on the pavements. I should you the RealSense, the sensors. This particular one was position, but they also have one that's depth sensing, which is useful for, again, obstacle avoidance. Sorry. What is step sensing? Sorry, depth sensing. Oh, depth sensing. Depth sensing. Forgive me. Another one is that we can actually go outdoors and use a drone autopilot on a car and simply navigate by GPS, alone. Now, GPS is not high enough resolution, but now RTK GPS, which uses a base station and a moving one, is quite affordable and can get you centimeter-level resolution. So this one here matches another GPS that's a base station that you have, locally. It's interesting. But you're not using any sonar anywhere, huh? Is it- Sonar's really not useful for us. ... too unreliable? There used to be something called the SparkFun Autonomous Vehicle Competition, which is no longer around. And that one was outdoors. And people originally used sonar to do things like avoid their hay bales on the side, et cetera. Yeah. Right, right. Very noisy. So there is not a sensor that exists that we haven't explored. So yes, we had sonar, but then we would create sonar rays- Whoa. Cool. ... of 360-degree sonar. Nice. Then, of course, the sonar's really old school, but the more recent ones are these time-of-flight sensors, these little, tiny time-of-flight sensors. So this one actually was just to compare sonar with time-of-flight sensing. What's time-of-flight? Is that LiDAR? It's like LiDAR. It shines a light beam out and then measures the time it takes to come back. I see. So basically, sonar is quite a wide beam and very noisy. The environment can obstruct. Time-of-flight is much better and cheaper and smaller, et cetera. What about radar? We have radar, as well. Radar is still relatively expensive. Also, radar tends to be relatively broad beam and that's not a problem. So if you're in a full-sized car and you want to detect the car in front of you, it's fine for that, but we have other ways to do it, cheaper ways to do it, time-of-flight, for example, because remember, our distances are a couple meters, not tens of meters. So we don't have any need for radar because we can solve it with time-of-flight. Then we have solid-state LiDAR, which, again, is affordable and, mechanically, a little simpler. We do a lot of crashing, so mechanical robustness is a good thing. The spinning LiDAR I just showed you is basically a 2D, plainer one. The solid-state LiDAR has kind of a wedge shape. And so you get a little bit more structure that way, but again, the depth-sensing cameras can give you much of the same information and they also give you sort of visual texture information, which is useful on top of that. I'm trying to think what other sensors we play with. Oh, there's a really smart one. So you can do a lot with cameras and one of the winners uses ... So most of these cameras, as you saw, are looking out, looking forward and a little bit down. And we're racing indoors. So what people realized is that, if you know where you are on the track, you have a huge advantage because you know where the curves are. You can go fast on the straightaways and slow on the curves. Basically, you have foresight into what's going to happen. So how do you localize on an indoor track? We have cones at the corners to detect when people are disqualified. And so people realized the cones were sort of a foot signature, a fingerprint, if you will, for the track. And so they would use LiDAR to identify the cones. Now, you can do it optically, as well, because the cones are orange. And so they would basically localize that. And then a genius guy named Andy Sloan realized that there's another fingerprint of the track, of the course, which is that the lights on the ceiling had a distinctive pattern. And so his car actually has a fisheye lens and the camera looks up, as a fisheye lens, and it can see around it a little bit, but it also sees the ceiling. And it basically just steers by looking at the lights above it, which is absolutely brilliant. And you don't consider that cheating? Just any way to hack you- It works great indoors, but now we make them go outdoors, as well. I see. Nice. And so it'll fail outdoors. We do races in a place called Circuit Launch in Oakland, near the airport. And they just renovated it during COVID and they changed the lights. But yeah, so every trick you can think of. So it's called cone slam, by the way. Cone slam. Simultaneous location mapping. So cone slam and light slam. Anyway, I could go down the rabbit hole, but I just wanted to say that we do racing, which is largely about going fast and beating other people, but there's also ways to explore self-driving cars at tiny scale in a city environment. This is one cute version of it. Actually, I'm trying to remember, actually, what it's called. We'll put it in the show notes afterwards, but things like this use cameras and little Raspberry Pis. And it's called a Zoomy. There, it just told me. Nice. And you can build a Lego-sized city with stop signs and street corners, et cetera. You can go to IKEA and get these kids carpets that have cities for toy cars, et cetera, and you could actually run one of these in it and it'll navigate the city. So these things are super ... they use Jupiter Notebooks and Python and they're really fun and easy and super cute. You don't have to race to be able to participate. Its eyes are so evocative, too. I love it. They are. Yeah. It just said, ""Find Zoomy on your wifi,"" and then if you go there, it runs a little web server and it's running a Jupiter Notebook and you can do things like drive in the town. What are the people that are winning these things focused on? Is it actually knowing your position and orientation really accurately or is it sort of strategizing your path through the course? What's the challenge? All the above. It's things like racing lines, which is find the ... Basically, racing lines are the shortest path around the track, and going fast in straightaways and then braking at the right time, the classic racing stuff. Localization helps a lot. It allows you to create a strategy. Then there's passing strategies and avoidance strategies and how do you win when you're going head to head, as they always are? Is drafting relevant? These low speeds]? No, it's not. It's not. Yeah. Yeah, it going 20 miles an hour, but they're small. The biggest challenge, though, and this is one that does not show up a lot in real self-driving cars, is we're going freaking fast. So 20 miles an hour in a one-tenth car, that's 200 miles an hour. And so this is realtime robotics. And I don't know how much you've spent with realtime robotics, but 20 milliseconds is slow. And so our interloops could be running them at a thousand hertz. So you do inference at 20 milliseconds on a Raspberry Pi Three? Depends. So no, we're not doing 20 milliseconds on Raspberry Pi Three, but we can do 100 milliseconds on a Raspberry Pi Four. Right, right. That's sort of your AI loop. Then you might have a motor controller loop that's running faster if you're running an IMU or essentially you might be detecting. The IMUs, we're just getting the inertial measurements, would be detecting something like drifting. So if you're supposed to be going straight and you actually have some lateral movement, that means that your tires have lost traction and you're skidding. So how do we do it real time? And the answer ... So you need at least, I would say, 30 frames per second, at least 30 frames a second. Real cars are not sampling that fast. And if you're going 30 frames a second, you may have to make some concessions. So first of all, our cameras are relatively low res, so we're running at 320. And our models are pretty simplified, might have three or four layers, but no more than that. We're not running a lot of models simultaneously, so it's end-to-end neural networks. So basically, it's just pixels come in and commands to the steering go out. So we're not running parallel networks, et cetera. But yeah, these are all great challenges. If you tell somebody, ""Keep it under $400 and win,"" it requires a lot of creative thinking about that. And you can't just throw compute at it. It's not okay to show up with the kind of stuff you'll find in the trunk of a Waymo. That's cheating. You show up with your Jetson Nano or your Raspberry Pi Four and then you use some creative algorithm or technique to win. And that's the fun. Yeah. That's so fun. I mean, just a Nano or even Raspberry Pi Four, that's not joke these days, though. It's funny. To me, it's just amazing what we can do. Yeah. I don't know. The Nano right now is 60 bucks or something, the two-gigabyte one, and the Raspberry Pi Four is about the same. So it's really great, but what's really important is the software frameworks now support them. So TensorRt, TensorFlow RT, keras, Fast.ai, they're all starting to think about edge compute. I just want to say, they've put in so much effort and they're so friendly. I feel like, when I've asked questions, they've just been unbelievably helpful. So I don't know, I feel like I just need to give them a thank you for that. Absolutely. And everyone's doing it. So Nvidia, obviously, they didn't have to come out with a Jetson that cost 59 bucks, but they did. Amazon's set up the RoboMaker, which is their virtual environment for this. Microsoft is investing a huge amount into edge AI. The Intel RealSense I just told you about, Raspberry Pi, et cetera, all the Google stuff is focused on edge AI, as well. So the notion that the edge ... So the cloud, the core is one thing, but the edge is completely different in that you have real-world inputs, realtime inputs, realtime outputs. And they tend to be small, cheap, power-efficient, et cetera. And so you realize that the internet has always been this way, that it's a combination of the edge and the core and that it shifts. Where's the thinking done? Where's the intelligence? And it's going to be some balance. We got the cloud, we got the core down right, but the edge is an opportunity to basically pre-process a lot of data before you get it. Because we can gather so much data, if we can pre-process it with deep learning and the edge, it actually makes the core smarter, as well. Totally. So it's really exciting, what's happening right now, not only with deep learning, but also computer vision. A big fan of our project called Open MV, which is basically ... It looks just like one of these cameras, actually. So we've been talking a lot about deep learning, but computer vision is equally exciting. This is an Open MV and it's basically, again, a $50 board, but it's camera and it's got compute onboard and it's basically running Open CV. And it runs it really well with a Python interface, a fantastic IDE, and you basically just stick this on anything. It can run a car just all by itself. And now you've got the stuff that was like a PhD 10 years ago of edge detection and some simple deep learning networks, object detection, all sorts of transforms, et cetera, are all just built in, already built into this thing. And any kid can now use this to do sophisticated computer vision. So actually, cars that use nothing more than this have consistently scored in the top 10. Wow. And you can literally make a self-driving racing car for less than $100 with something like this. So cool. Before I let you go, I'd love to ask you a couple broader questions. I think you watched the Peter Norvig episode and I was really curious to ask him this. You're here, too, as someone who's been watching machine learning for longer than most. And I'm really curious what your perspective is, having sort of seen a long arc of this stuff. I guess everyone must ask these questions, so I feel a little shy asking them, but I'm really curious what you think. When do you think we'll see, for example, autonomous cars working in our life at all times? And where do you think this goes? Do you feel like there's probably fundamental limitations to what we're doing with neural networks now or do you feel like just kind of scaling up what we have leads to singularity-like outcomes? Everything I know about deep learning I probably learned from listening to your podcast because I'm dabbling. Peter Norvig's a legend, but I- But you were training neural nets back in grad school, no? Yeah, but these were hopfield nets. And we hadn't really figured out the whole notion of layers and convolution and all this kind of stuff. So there was a real dead end and it was very frustrating. So look, with drones, once we got one drone to fly, I was like, ""The sky's going to be dark with these things."" They're essentially free. It's done. Think of how great it would be to have total information awareness of our planet. Rather than waiting for the satellites to come by or for the clouds to clear or having cameras in every stoplight, what if we could just sort of have a camera anywhere, anytime, to measure our planet so we could manage it better? So it seemed to be obvious that the missing middle, if you will ... We had cameras on the ground and we had cameras in space and the missing middle was the air, which is an opportunity to be anywhere, anytime, higher resolution. Just seemed like a good thing to instrument our planet, and yet here we are. There's nothing in the air. I can't believe it. It's been 15 years and we still don't have sky dark with these things. We really don't have any autonomous drones at all in operation, except for the military, like we had back then. So what happened? Well, the problem wasn't technical. The problem was regulatory. It is the FAA will not allow drones to fly beyond vision line of sight, won't allow them to fly without one-to-one pilot with sticks, like an animal. Basically, the FAA will not allow drones to be autonomous. It won't allow us to break the one-to-one ratio, which we've achieved nothing, in a sense. Imagine a robot that could only work tele-operated. What have you achieved? You still have one person, one robot, and that's where we are. Drones essentially have to be tele-operated or at least have someone monitoring autonomous operations, which is even worse because now they're not doing anything. So that was disappointing. It was disappointing for a regulatory reason. And I can understand it and I work with the FAA pretty closely on trying to resolve it, but the question about cars is more about society and regulation than it is about the cars. Can cars be autonomous today? Yes. Can they be autonomous everywhere perfectly? No. Should it be okay for cars to be deployed autonomously in some places where they can be highly reliable, but not everywhere? Absolutely. And companies like Voyage are doing that with retirement communities, closed courses, if you will. So I think the question is, are drones used today autonomously? Yes. Are they overhead right now? No. Am I disappointed there aren't more of them? Yes, but obviously they go where they're needed most. And I presume that self-driving cars ... I think we're setting the wrong standard. Should we have self-driving Ubers in all cities? Probably not. There's not a lot of advantage to it. Waymo's doing a little bit in Arizona, but that's probably not a game changer. Where would self-driving cars be a game changer? I think, actually, retirement communities are a really good example. They're quite empowering and liberating for people. So I think, if you reset and say, as the technology gets better, will we identify really useful places just where it wants to be and focus less on the tech and more about the marketplaces and the demand? Will we find those places? And the answer is yes. And so I think that all the questions about when self-driving cars come, they all kind of come from a technology place. And I think we're in our Silicon Valley bubble. We really need to understand the needs, the use cases, the places that would benefit most from them and think less about the tech and more about how it's going to be used. Interesting. Interesting perspective. Thanks. So there's two questions that we always end with. And the second to last one is, from your perspective, especially from drones and robots, what's one underrated aspect of machine learning that you think people should pay more attention to? I think I mentioned I'm really into simulation and synthetic data. And I know you had a couple episodes now on synthetic data creation, but I do think this is the golden age of simulation. I work really closely with Microsoft and, if you've used Microsoft Light Simulator 2020, which basically uses satellite and aerial data to recreate the entire planet, photo-realistically, with using photogrammetry to create 3D models of the planet, but realtime with weather and everything, as it really is. I think this is the golden age of simulation, the golden age of rendering that. And as a result, our opportunity to use these powerful engines to train models better ... We talked about domain randomization, we talked about synthetic data, but I'm most excited about that because I feel like we've kind of hit some limits in the ability of humans to train models. And even CPT3 is limited by the, as you've mentioned before, is limited by the amount of data on the internet, which sounds like a lot, but is never enough. And so I think that we need to think really hard about our synthetic data generation strategies so that we can break through the limits of real data and start training them on things that we can only imagine. Totally. Okay. And then final question is, for you ... And you've actually built a pretty sophisticated end-to-end ML system now. What's the biggest challenge of getting that to work or what's a challenge of getting that to work that people might not expect when you just sort of lay out what you're doing? First of all, I should say I did not build this. This is the Donkey Car team and there's a lot of people there who get credit for that. Tom Kramer was the originator of the current stack. First of all, one thing you should know about end-to-end is that it is end-to-end. All we have is one channel. Pictures come in and controls go out. We're blessed to have things like TensorFlow that'll do that, but once we start introducing other things like depth sensing and those other sensors we talk about, we're probably going to need to introduce multiple parallel networks running. Now, should the obstacle avoidance be also running on machine learning or should that be more classical control theory, if you will? How do we combine classic robotics control theory with deep learning? One's probabilistic, the other one's deterministic. How do we merge them? And so I think there's some interesting work to do to start to introduce multiple inputs? Right now, we have one input, one output, but of course in robotics, it's MIMO, multiple input, multiple output. And I think if you stick to the $400 limit to be able to do multiple input and multiple output with deep learning in all these channels is super interesting. I don't know whether we're there yet, but that's sort of ... You asked, what have we learned? And we've learned that you can do one channel in one network pretty easily and it works amazingly well, but it doesn't scale to multiple inputs. And if you really want to start winning in competitive races with other cars and actually doing what a human would do in a race, we're going to need to bring in all the channels and sensors and data we can and that means at a different architecture. Although, the part of that car that's going to come down is the running neural networks. Right? I mean, I feel like that's the thing that seems to be dropping the fastest. Well, that is good news. The Raspberry Pi Five or the Xavier Jetson can do that, then yeah, maybe we can just apply our same technique and just say, okay, let's add another network to keep track of the other cars. Add a third network to keep track of the sliding, the friction, how the car's actually mechanically moving on the track with the IMU and then find some way to merge them. That would be super exciting. To do the whole thing at 30, 50, 60 frames per second under $400, I don't think we're quite there yet, but you're right, that's going to be the focus over the next couple years. Awesome. Well, thanks so much. It's an honor to talk to you. That was so much fun. This was a pleasure. Thanks for listening to another episode of Gradient Dissent. Doing these interviews are a lot of fun and it's especially fun for me when I can actually hear from the people that are listening to the episodes. So if you wouldn't mind leaving a comment and telling me what you think or starting a conversation, that would make me inspired to do more of these episodes. And also, if you wouldn't mind liking and subscribing, I'd appreciate that a lot.",11414 +Adrien Treuille — Building Blazingly Fast Tools That People Love,https://www.youtube.com/watch?v=xApf15JyZYU,2737,2020-12-04,"The state of the art is often like, ""I'm going to send you an email and just do a one-off exploration in Jupyter notebook and tell me the answer and paste it into a PowerPoint presentation."" Like that's a lot of how the rest of the company interacts with the data science team and the machine learning team. And that's kind of insane. It's so inefficient. And so I think that the aspiration that I have for StreamLit it is that almost as a byproduct of existing workflows, the engineers working on those teams are empowered to sort of bring their work directly. Inject it into the entire company and allow the whole company to make decisions and predictions and stuff in the same way that they can. I think it would have a big, big, big impact, and it already is starting to. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Adrien is the CEO of Streamlit, and a good friend of mine. Before Streamlit, he founded Foldit, which is a famous crowdsourcing project that enlisted millions of gamers to solve real scientific challenges. He also served as AI Project Leader at Google X, and the VP of Simulation for the autonomous vehicle companies Zoox, and was an Assistant Professor of Computer Science at CMU. I can't wait to get to all these things with them. I was kind of wondering how to do this, it doesn't feel like... It'll be just like talking to an old friend. But I think it's inevitable that that's what it's going to feel like. And I don't know exactly the best place to begin. But I thought it might be interesting for you to tell a little bit of the story of your career. Like I know that when you were younger, you're super into music, and you're a great guitar player, then I think you got into graphics? Now you're doing like a really interesting company. And you've done some deep learning. So how does it all fit together? Like what is the arc of Adrien's life? Yeah, well, the arc is that I keep changing what I'm doing. Half of the time, because I realized that there's something else even cooler that I want to do and the other half of the time because I realize I'm never going to be good at whatever it is I'm doing right now. So when I was in high school, I wanted to be a guitar player. And I ended up going to this like jazz club, those kind of really hot in the 90s in New York called Smalls and seeing this like totally epic young guitar player named Kurt Rosenwinkel, who became very famous, subsequently. But at the time was a little less well known. And I was like a high schooler who didn't shave or know anything. And I went up to him, and I was like, ""Excuse me like Mr. Rosenwinkel, would you please teach me how to play guitar?"" And he was like, ""Yeah come to my place. It's like in Brooklyn, like tomorrow or whatever."" So okay, so I go there, and he becomes my guitar teacher. And it was like, absolutely one of the most inspirational episodes in my life. Because here was someone who just lived in a musical dimension that I couldn't believe, basically. And I was so inspired every time I took lessons with him. And I was like, ""I can do this."" I even asked him, ""Do you think I could be a professional guitar player?"" He was like, ""Yeah, I think you could."" At one point I was like, ""Hey, do you think like.... How often do you practice?"" And he was like, ""About 12 hours a day."" And I was like, ""12 hours a day? Are you kidding me?"" And he's like, ""Yeah, I only practice when I feel like it."" And I was like, ""Oh, wow, I am not going to be a professional guitarist."" So that was me realizing I was not going to be a professional guitarist. And then I wanted to do international relations. And I became disillusioned with that. And I got into math. And I ended up becoming a professor at Carnegie Mellon and working on both basically machine learning problems and big data problems. And we had jobs running for hours, every single night and for days on end, actually. And that was really fun. I actually loved it. And we were using Python, NumPy. All these things that are now very much part of the Zeitgeist, we were using them, like pre1.0, when why would you use Python instead of a MATLAB or something like that in those days? And what were the... Before that you made Foldit? Right, which I think is one of the most interesting... And maybe do you want to talk about Foldit and what happened there? Yeah definitely. I think that was, if the guitar one was an example of me realizing I was never going to be that good at it. I think Foldit was an example of me seeing something else really cool and jumping at it. And so what happened was, I had been working on this numerical stuff. And then right at the end of my PhD, some basically, biochemical biochemistry professors and I, and they got this group together. And we had this idea of, let's create a computer game out of protein folding. And so it was... First of all it's a really easy first thing scientific question, because it just so happens that it's like very difficult to create simulations of protein folding, takes a lot of computational power to solve. It also is a problem with like enormous real world consequences. Because in short, proteins are these machines in your body that carry out the basic functions of life, their shape determines how they do that. And folding is how they get that shape. So understanding how proteins fold and why they fold into certain shapes is like literally like the origami of life itself. And so here is this super interesting scientific problem, very difficult to solve by computers, we had this line on this, like totally crazy, kind of fantastical take on it. Which was, let's turn it into a computer game which may or may not be fun, much less had any kind of scientific impact. And we just ran with it. And we did it. And it kind of blew up. Over a million people contributed to this really profound scientific problem all over the world. Some of the best Foldit players in the world, were people who scarcely thought they had a scientific bone in their body. And all of a sudden, they're at the top of the leaderboard. And the BBC is calling them up and asking them to interview them. This really happened. And so, wait, the game though... I mean, the game just because people may have trouble imagining this. I mean, I played this game, I was not good at it. But if it's like you're trying to rotate and manipulate these, like molecules, basically. Yeah, yeah. Why can a person do this but a computer can't do this. Yeah. Well, okay, if your initial intuition was the rules of why can you recognize a face? Well from a computer's perspective, it's like, super hard to recognize a face. You need this giant neural network, and you need to like measure all these things, and it can involve all these things until you really need like, in a sense, you might say millions of equations are stacking up in order to create this, like face recognizer. And yet, we can do it instantly. And similarly, in the case of a protein, technically speaking, there are this geometric number of pairwise atom interactions that are going into it, and these atoms they sometimes repel one another, they attract one another as the case may be. And so it creates this like, network of sort of attractive and repulsive magnets, basically, and then the ultimate shape is some kind of stasis. So you would look at that problem and think, like this crazy math equation to solve what it actually would look like. And yet, the scientists who work in this field develop an intuition that's very definite. And in fact, they could say, ""This looks like a real protein that we learned... We know it's shaped through a crystal structure."" Or, ""This looks like a protein that was designed badly by a computer."" So in essence, it had this similar flavor, which was that like, over time, you could actually build an intuition for what looks right, and what doesn't. And that was like the ur-idea that led us to believe that potentially that intuition could be essentially trained. Here, you're training humans, actually, through a game. Yeah, that's right. That's right. And that's actually a really fun process. And incidentally, the way that we train them, this actually gets into the game thing, is you actually build a simulation. Well, we had a simulation of how proteins fold, and then you let people play with it. And in essence proteins, they are physical objects, like they're a little different from the ones that we normally play with, because they're suspended in water and stuff. But if you pull on them, they resist in some places, and then they don't like to bang into themselves and stuff. And so as you play with them, and as you sort of flex them and pull on them, there is an intuition to the game about how these things work. It's like no different than playing with Play-Doh or Silly Putty, at some point, you start to understand the underlying material. And you don't have to... It's not completely new, when you press on it, what's going to happen. And so in a funny way... The long story short is that I think it was hailed as sort of a certainly a sort of milestone in attempting to build a giant, large scale sort of human computer, computational complex. And also, we were able to publish papers in Nature and in some cases, and in PNAS and other great news, with insights that have been derived from the players. So that was great. But for me, one of the most fun things was actually that phase of like, how can we build an actual game that gives millions of people the intuitive sense for what this thing is? And is it possible to hand them that and then have them sort of understand it and crack it? And then didn't you make a... You made a second game too, right, that actually- Yeah. Yeah we created a second game called EteRNA. And that also actually, we published. Similar idea. It was a scientific discovery game, we enlisted a bunch of people, both these games are going strong, actually. So you can play them both right now. And the real innovation in EteRNA is that rather than just do everything in simulation on a computer, we were actually using high throughput synthesis to build the molecules being designed by the players. And so in essence, your score was determined by a tiny little high throughput experiment that was run, which I just think is so cool. And a lot of interesting stuff comes out of that. And you don't need a simulation for one thing. So how did it work? The players would propose molecules and then you would synthesize them? Yeah, they would propose molecules, they would initially they would vote on them. The thing is that the cost of these experiments keeps going down. And so that actually means that the games are being designed against this, like super Moore's Law kind of change in biochemistry in terms of like, what is possible to synthesize, how fast, what kind of experiments can you run? What information you get back? That's all shifting underneath the game until we were actually redesigning the game over time as these things changed. But yeah, in essence, they would design them, they would vote on them, we would synthesize them, we would share the results with everyone, everyone would get a score, everyone would look at what everyone else's molecule did. And then- And what would the score come from? ... Rinse and repeat. So the score was... So, okay. First of all, rather than working on protein folding, we were working on RNA folding. And spoiler alert COVID-19 is an RNA virus. And in fact, it turns out that like Moderna this company, that's famously one of the contenders to create a vaccine for COVID-19 is an RNA research company. So as it turns out, that actually RNAs have at least shoved aside if not, in some ways to plant in proteins as being a molecule of like intense interest by biochemists and pharmaceutical companies. As a sort of chemical substrate with which to build a whole new class of drugs that could potentially, essentially enter your body and then interact on a super deep like quasi-computational level with what it sees, in a chemical sense. So we were using RNAs this time round, which has slightly different properties than proteins they tend to be bigger in some ways, and they're bendier, they're more flexible. And so what we would do is, we would say, ""Try to create an RNA that folds into this particular shape."" And initially, the shapes were essentially just things we invented, that we thought RNAs could plausibly fold into. And then over time, they became actually more pharmaceutically interesting. And in fact, the most recent challenges on EteRNA do have to do with COVID actually, directly. If this sounds intriguing to your listeners, I think it would be super cool if you guys take a look and play around with it. And it's very, very current, actually. But yeah, we gave them a shape. They were trying to build RNAs, in other words, designed sequences of nucleotides, that would naturally fold into that shape. We would take the most highly voted molecules and synthesize them and then basically figure out what shape they folded into, which you can do. And then we would basically use a sort of root mean squared error, distance function to tell you just like in machine learning, to tell you like how close you were to the shape. And so the neural net, as it were, the black box is the human mind. But other than that, it was just same thing, a loss function an input function. And so and then you just do this thing over and over again. And ideally, through some kind of human based, Gradient Dissent, little plug, in the community would improve. And lo and behold, they did. So that's just so cool. But I guess where are we at now? Because think about games like Go that are so well studied. And computers getting better than humans. Have artificial neural nets surpassed Gradient Dissent, human neural nets at this point? Yeah, yeah, yeah. So basically, yes, actually, in a way, but the other thing is that the game design shifts. And so it's just as... It's similar to the real world. Yes, we have better neural nets, but it doesn't mean that we're all out of a job yet. If anything, it means that new jobs are being defined. And so and if that sounds glib, it actually did kind of happen that way into like microcosm of these games. Which is to say that like, it tended to be the case that felt like raw let me beat a computer at this task was not the most interesting thing that came out of it. And in any case, it was a moving target because there is a universe of researchers trying to create better and better algorithms and using neural nets for that matter, and a lot of other statistical stuff. And so that boundary kept shifting. But in my opinion, the really interesting thing about having a large number of humans like in playing this game, and basically talking about it on the forums and sort of creating a community around it, was that they ultimately came up with interesting ideas and shifted the game design and sort of did this like human element. Which is like, what other interesting stuff can we do here. So for example, at one point, some of the players in EteRNA basically noticed that certain motifs, like putting certain nucleotides in certain patterns was more likely to create a stable RNA. And this is just like a purely human thing. It wasn't something that we were like necessarily looking for. And then we were able to like basically rigorously prove they were right. And that's starting to cross into like science, basically. And so just we have not automated that yet. So those are the kinds of things that I think ultimately, to me, are the like, more important outcome. Rather than just like, we temporarily beat the best computer algorithm in 2003 at this very specific computational task, which was ultimately not going to be a winning formula. When you became a professor, what was your research areas? What were you interested in? Yeah, there was always these two pieces of it, which were, if not... They weren't really in conflict, but they didn't really connect. So one of them was creating computer games, we actually created a bunch that did all kinds of interesting things. We created computer games that like, allowed us to, like capture a ton of information with my student, Alex [Linpacker 00:16:39]. Tons of information about how artists draw faces. And we actually put out a game on iOS where you could like draw celebrity faces and then try and guess a celebrity. It was like, based on Draw It game, or something, I forget what it was called. Got a bunch of people to play that. And so we were literally like paying Google AdWords get people to play our games to create these like esoteric scientific datasets to study these like recondite questions, which is so cool. Such a just a weird thing to do, I guess. And pretty weird compared to the other professors. And at the same time, we were also like writing papers on basically applying machine learning methods to crazy graphics problems. We were like applying machine learning to smoke simulations, and to the light transport equations. We were running... It was like whiteboards full of equations. And it was running jobs on clusters that literally took days or weeks to run. And so we were doing hyper-parameter searches, and all this stuff that's now suddenly cool. So those are the sort of dual worlds that I guess, probably similar to you, there's always been this pull towards on the one hand, like math. And just, the like austere perfection, that just fine of that. And then also just creating things that people want to play with, or use and sort of the delight in like creating products, basically. Well, I want to jump ahead to StreamLit to give it the time it deserves. But we are skipping over a whole bunch of other amazing things that you did. But I'd love to hear the story, in your words of coming up with StreamLit, because I feel like I watched some of it. And it appeared to me like it almost just like popped into your head is sort of a complete idea that was sort of like immediately awesome. So I'm curious to know what the experience was for you. Yeah. Well, it didn't quite happen that magically. It's funny that on the one hand, I was working on like machine learning problems and numerical math. And on the other hand, I wanted to build products for people and like build communities around those products. And weirdly, I feel like those two things have come together in this product called StreamLit. Basically, what string does is it lets machine learning engineers and data scientists build little interactive artifacts that allow them to share their data sets, their models, with their prediction for the future, etc, with one another and inside of organizations and also with the world. It's an app library for Python programmers. And we can go into actually why it turns out that's actually a really important thing, both for like people who want to show off their skills, but also in big corporations that need to like export machine learning into the whole company. It turns out that they both need this sort of superpower that StreamLit provides. But how it came about was, I had worked on a project at Google that got canceled, very like heartbreakingly to me. And it was a very public failure. If you choose to look at it that way. In retrospect, like all my failures, were my successes. And all my successes were not necessarily successes. So but it's kind of the story you're telling yourself at the time. And so I took it really hard, basically. And I then took a job that I wasn't like super excited about, in dating parlance, you might call it a rebound. And then I eventually basically took some time off. And I started just writing code, which had long been a passion. And a friend of mine named Lukas Biewald was like, ""Hey, dude, let's go into the woods and like an Airbnb, and we'll write cod together."" Which I thought was the coolest idea. So we went to the woods, and we started writing neural nets, you and me. And that was one of several projects. I'd also been working on a stock market simulation, which actually also came out of conversation with you, Lukas? Well, the funny story there is I'm like, ""Lukas, I think that like..."" I'm telling him over dinner. So I'm telling you some like statistical properties that I thought the stock market might have. And I remember you were just like, ""Adrien do not invest on this assumption people. Lose their shirt thinking stuff like this."" And I was actually like, really touched. I was like, ""First of all, I had no intention of actually investing."" And I was like, ""I'm not even like, well enough organized to do that."" But like, I was like, ""Wow."" Like kind of touched, like Lukas is like really looking out for me here on this math conversation, we're having. So, I was working on a bunch of fun products like that were kind of mathy. And that also you needed to be able to play with stuff and see it. And so basically, naturally coming out of that workflow I just was dissatisfied with everything else out there. So I started writing my own tools that allowed me to basically take Python scripts I was writing and turn them into little interactive artifacts that would allow me to like play with them and see their properties more tangibly. Than just like changing a number and rewriting the code or writing a loop and then running it 6,000 times. And that need kind of just snowballed. Like, I'm skipping the part where we had some heartbreaking pivots and stuff. And I'm happy to go into that too. But it is true that on some level I wanted it, a bunch of my friends wanted it. Some people who eventually became my co-workers were like, let's all work on this together. Some big companies started using it investors started- Wait, wait, wait, wait. Yeah. Don't skip the heartbreaking pivots. I mean, that's like why we do these things. Okay, okay, okay, yeah. Running into a wall at 90 miles an hour. Were their paths because it really seemed to me like, I feel like I watched you kind of come up with this idea that it seems like it's the idea that's the core of what you have now. I'm actually kind of surprised to learn that there was... Well, okay, yeah. So the pivot was, it all comes down to you throwing me off, first of all, showing me the way Lukas, and then throw me off my game. So what happened was, I started using Streamlit, as a way to understand these stock market simulations actually. And the key thing was that once you build this model, like you want to be able to change parameters really easily, and then see how that affects the model. And obviously, if the model's a straight line, it's not like super interesting. But when it's like... You get it. When it's something that's really just computations happening. And especially, it's like a non-trivial simulation of the future, there is crazy... I mean, it's one of the principles of like dynamical systems theory, like you can change a number a tiny little bit, and all of a sudden totally different things start to come out, and where are these bifurcations and stuff. And so it's really fascinating, and worlds get created. And that was the original version of StreamLit. And in fact, if anything, we've come back to that. But what happened was, you invited me to go out in the woods and code neural nets and stuff. And at the time, Weights & Biases was you were ahead of me, in the sense that I think you'd started a company already. But it was pretty like rough. Like you were like, ""Okay, let's use Weights & Biases for this project."" And in like, five minutes, you were like, ""It's not working, forget it. We're not using Weights & Biases."" And so yeah, so for all of you, people who think the Weights & Biases are so polished and perfect, I can remember days, it was still very much early. So anyway, we were doing this neural net stuff together. And I was like, ""Oh, this is really cool."" And I think I probably got like a little FOMO about, like how cool your incipient company was. And so I kind of started to work more on deep learning style applications for Streamlit. And when we initially fundraised, we sort of had a superposition of two products in some ways. And what basically happened is that we had some signal that was positive, like people were using it, and not just because we were bugging them every day. But we also just didn't have as much signal as we wanted. And we, in some parts of the company early on, we like... A company wanted us to install StreamLit internally, we put a ton of effort into it. And then like it was crickets after we did this, like big install for them. And so I remember talking to one of our investors who's like, super highly respected has been around the bush and he like, invited me for coffee and he was like, ""Adrien what are your milestones? Don't just send us investor updates, or like, we're still building."" And that's a little tough and we were still building and searching basically, and we were post fundraise, so it wasn't like, we were just totally in like bushwhacking exploration mode. Like there was some kind of clock that was ticking. So we actually wrote a huge slide deck, which was like everything that the product can become. And we shared it with everyone who was using Streamlit. And we basically gave them like an hour long interview. And we'd like data scienced the whole thing. And we were like ""How much would you want this feature, that feature, that feature."" We like clustered them and everything. And actually it so happened that really the thing that people were most excited about was also the thing that had been actually kind of the ur-original thought, which is that you want to, once you've built a model, or once you've bought a simulation, or once you've built actually even like a non-trivial data set, you want to be able to rapidly like interrogate it, potentially in sort of ad hoc manner. So you want like arbitrary code, and you want it to elegantly do that without... And that's a different product category than just like Tableau or something, it's a little bit more computational. And we realized we should make this an app framework, sort of basically, a Shiny for Python, and you just have sliders and widgets, and it needs to result in a webpage that's interactive. And I resisted it until I was worn down by my co-founders. And then we all just agreed to do it. And we just went long on that. And we launched it, and it found resonance, basically. So that's the story. Interesting. So what did you add to it to make it do that? Because I feel like when I first saw it, I thought, ""Oh, this is Shiny."" Oh, really? Well, you could have saved me six months, dude. Yeah, well, the basic thing was whether we are going to have widgets. And then the next thing, which is like yes, in React a widget is you just say there is a widget and suddenly it exists. So why is that so hard? One of the reasons why it's so hard is because if you really commit to like writing an app framework, then it implies a whole bunch of things down the line about how you'd expect the product to work. So it's not like lines of code for the prototype doesn't translate into like how easy it is to get to from product perspective. The other thing is that, and this starting to get a little nerdy, but there was a question of like the event model. And one of the things that makes... Actually the thing that... Why is it hard to make a little app around your machine learning model. Why can't you just whip together a little Flask app with a React front end, and it's like, boom, it's done. And basically, the reason is because... Well, actually it is because app programming is actually really hard. And the hard thing about is that you have these events coming in, there's a whole event model, and then there's a state model, and then these things need to not mess one another up. And it needs to always reflect things properly. And that turns out to be such a hard problem, even for humans to like wrap their head around that we're still seeing major advances every couple of years in terms of just from an API perspective. How to not make that like a nightmare of complexity. And so, if you bolt on to that in a naive way, oh, and there's also a neural net, and it's like God knows what it's doing. And there's these giant datasets, and there's thousands of neural nets, and you can... It really becomes insane. And so we came up with this, I would say interesting and constrained perspective on this, which is basically, let's forget, let's throw out everything we knew about our programming, and just pretend it's a Python script. And it just runs from top to bottom, just as you would write a Python script. And then everywhere you say numLayers equals five, you're allowed to say NumLayers equals slider. And so if you'll notice that at no point did you actually say that there was an event. At no point did you say, ""Oh, there's a state that gets modified in this way, when you get an event from this slider, you just said NumLayers equals slider."" And so that was kind of like, how do we get to that? And so we figured out how to get to that it implies some constraints on like what we can do in terms of the apps that we create. But it also like massively, massively simplifies the thought process that you have to go through to create an app. And the way we phrase the product now is like, ""Turn your scripts into apps."" Which usually when you think of creating an app, it's like create an app from scratch, or lay out all the widgets and then implement it. But if you just think of it as like turning your scripts into apps, then it's like a much more natural workflow. A lot of people didn't think StreamLit was that cool. And then they like, tried it and then like, within five minutes, they're like, ""Okay, it's super cool. I'm going to tweet about it or something."" That's certainly contributed to a lot of basically natural growth that's not mediated by marketing, or it's just kind of endemic or endogenous to the community itself. Yeah, but I have to say we watch this stuff quite closely at Weight & Biases, and it does seem like you have maybe the hottest program that data scientists and machine learning people use. So, congratulations. Yeah, well, I mean, that's really kind I have to say, it's been really fun. And when you're on the inside, you're always sort of focused on the worst case scenario? How could this go wrong? And how are we going to tell the employees if this doesn't land and all this kind of stuff? So it's really like nice to hear someone say that. So I'm curious, one of the things that's always driven me crazy about working with you is you always want to try some new programming language. And I'm always kind of like, ""Can't we use Python? Like a language that it's like, well documented that you actually know better than me?"" I'm kind of curious where you land on that now that you're doing a lot of Python. Like, do you aspire to create this this type of construct in other languages? Totally, totally. Our investors are going to like hate to hear this. In no way does this benefit the bottom line at all. But, I mean, actually, someone from the Haskell community tweeted, we should see what we can learn from StreamLit and I was like, ""That was the best compliment in the world."" Because those guys are hardcore dude. And actually had we written StreamLit Haskell, there's like all these cool optimizations we could have done because a lot more about the program in Haskell, basically I mean. And in Python, you know nothing going on. You can just like, literally change the direction of gravity in like one line of code. And I was watching your podcast, by the way, and I saw Jeremy Howard say that Python can't possibly be the future of machine learning, which I unfortunately don't agree with. I wish that were true. And he was, I guess, a big Julia proponent. And I do think actually, the key concepts in StreamLit actually are not specifically Pythonic, that thing that I was telling you about where you just sort of think of your programming script, and there's no events and stuff. I mean, you could write it in JavaScript, you could write it in Julia, and I just think it'd be super fun to do that. So hopefully, somebody will create enough profits that I can legitimately spend some time doing that. I think that'd be so fun, and if you want to with me in the woods. I guess that's the part you hate. That does sound like fun. I get to learn Julia. Or Elm. Have you considered Elm integration? Oh, my God, I love Elm. Don't get me started. What else do you dream of with the app? Like, do you sort of feel like the structure is done? Or are there sort of like big things that- No, no, no, there's like really big things that are missing, actually. We've actually been calling this fulfilling the open source promise, which is to say that the way that we allow you to build apps is like I, in some ways, like so new, that a lot of things that are kind of obvious how to do in a traditional framework don't carry it over to StreamLit necessarily. Basically, the way that works is that people do... StreamLit is great for some use cases. And then you can hit a brick wall, but you're like, ""Oh, and I also need to have persistent state that carries over from session to session. How do I do that?"" There's no way to do it. And yet, when we sold this thing to the world and we told them what we wanted it to be, we said, ""Hey, this is a general purpose app framework that's specialized for machine learning and data science. But you should feel confident in using in all these use cases."" And particularly now, I mean, StreamLit, it's part of the standard of data science workbench at Uber, and it's being used by a bunch of big, sophisticated companies, and people are really pushing on its limits in many directions at once. And we know that we lose people, because they're just like, ""I can't do this in StreamLit. I can't I can't go forward."" So we've set ourselves this task, which is called fulfilling the open source promise, which is basically take the big things that you can do in other frameworks that you can't do in StreamLit and just address them one by one. But not... I mean, we could actually do that in like five days, if we wanted to just like throw the can at it. But you want to do it elegantly you want to do it... I mean, that sounds very, like glib or cliché. But if for nothing else is part of the fun to actually think about, like what's the real way of doing this properly with regard to this thing. We released custom components, which is like a plugin system that allows you to take like arbitrary React apps, or React or other kind of web components, and plug them into your StreamLit app. So that like dramatically opens the sort of footprint of possible things you can do because it's now like as big as React on some level, and react is sort of everything. And now we are adding way more visual customization, and notably, on October 15, we are releasing by far the biggest and most profound thing, which is single click to play of any StreamLit app. So the thing that you've been working on your laptop and the thing that you may have shared with the world, laboriously by putting it on Heroku, or on JCP, or something, you can now literally push a button, and it at a URL that everyone can see. Does that mean that you have to leave your laptop open for that URL to keep- Yeah, no, no, no, no, no. That's actually a cool product, too. And I people have asked for that product. There was actually an executive at Twilio, who became like, obsessed with StreamLit, he was like, ""I just want to be able to extend in a slack message, this app, like what I'm looking at my screen to my coworker. But I don't want like put the entire thing on a third party server, and I don't want to..."" And actually, there's a really cool... I love that idea. I keep trying to tell everyone what, ""We should really do this."" I feel like people would just be like, ""I'll pay, I'll pay five bucks for that."" Let's reflect this app off of the StreamLit servers. Wait, it's so funny, because I always use this weird thing that Twilio makes where it can kind of reflect stuff off my server. So Twilio's sitting on the thing that I use for that purpose. I know. You should tell that executive, I forget what it's called. But it's like a bouncing thing in the cloud, where you can have a stable URL. Lukas let's keep this between you and me and monetize it. Let's cut this part out of the... No I'm kidding. Yeah, I think that would be a cool product. But the way that the sharing works is we instantiate your app. So there's actually, a lot of the work is taking your requirements.txt file and taking other kinds of your app requirements and stuff and building an environment that reproduces your app, and doing so in a way that's like sane and non infuriating. But on the other hand, it's also taken a lot of work to build this thing. And one of the reasons why we feel confident, building it is because so many people are building it themselves in like, a super ad hoc way. And actually, companies are building it themselves, too. And in many cases, just being like, ""Can you build this for us?"" So it's like we don't feel like we are... In a way we don't really feel like we're blazing a path at all, we just feel like we're sort of standardizing what everyone's doing. And then hopefully just making it like, way easier to do. So. Yeah, that's why we think it's a really cool feature. And if you want it to be private... If it's public for free, if you want to be private, it's a paid feature. And that's actually the next step for StreamLit. So we'll see how it goes. Nice. Congratulations. One thing I wanted to ask you about, I don't know if this is too far out of left field. But another thing that's been notable knowing you is how interested you are in meditation. And I was wondering if that connects to the work you do at all, in StreamLit. If you're kind of like working life is connected to the, I guess, like the... Would you call it spiritual? Yeah, yeah, yeah, yeah. That's something you're interested in. What a delightful question. The funny thing about meditation is, it's impossible to tell whether it's goddamn working or not. So I can't answer your question. At the time when my projects had been canceled, and when I was forced to reckon with a definition of Adrien that didn't just involve like, creating cool projects that everyone loved one after another without... And I suddenly became very interested in meditation. And so maybe I was seeking a challenge that I could win. And no one can tell whether you're winning or not at meditation. And or maybe a self for the pain in life. And it actually just wasn't my project being canceled, which is fine. But there were a number of like, personal things in my life as well, that were just really painful. And I think that I did something that everyone does, which is that I took what were legitimate problems, you might say, in my life, and I extrapolated from them, more problems, and then I extrapolated more problems. And I essentially constructed like a prison in my mind. This is sort of kind of a Buddhist way of looking at things. And I think that it's like a very, very natural thing to do. And it actually happens constantly, like every second and it's very harmful, basically, in the sense that, if nothing else, it's sort of taking you away from what's actually happening. I think discovering meditation showed me that you could dissolve those extrapolations. And in fact, that life wasn't quite as bad as it seemed, that my personal problems weren't as insoluble as I thought. And that way more than Streamlit or anything professional has altered the direction of my life. In essence, I do believe that meditation and it's not the only thing, but meditation can help bring you like a little bit more in contact with reality. I also think probably one of the most important things you can do as a product designer slash really anythinger is be in contact with reality. And it's not as easy as you think. Or at least, I thought, to do that, and therefore there might be a parallel there. But it's impossible to know. Have you continued to meditate through running Streamlit? Or has it become less relevant? Yeah. Well yes, it became less relevant. And also my life became worse again, I became super depressed when I started again, and now I'm feeling less depressed. And I also started taking antidepressants. Antidepressants is like a good 30 minutes of meditation a day easy. And it only takes two seconds. For me meditation looks like spending a little bit of time every day, just observing my mind construct, and then destroy and construct and destroy infinite problems and solutions and fantasies. And taking a little bit of time every day, and just remembering that that's A, happening and, B, not actually connected with anything real. To me anyway, it's kind of a joy. And it's kind of like a good thing to remember. It's like, remembering to enjoy everything else in life that's worth enjoying, it's just easy not to do and probably a good thing to do. That is a good sell for meditation. So we always end with two questions, I'm kind of curious how you'll interpret them. And the first one is, what is an underrated aspect of machine learning that you think people should pay more attention to? There's a lot of people who are very focused on this idea that we're all going to lose our jobs and the computers are going to make all the decisions. And I think that a much more plausible outcome for machine learning, as we understand it today, is just to massively increase our ability to like measure the world, basically. Not just have a security camera, but actually know how many people are walking by and how fast they're walking by, and whether they're men or women and cars, and all these kinds of things. And understand what appliances are plugged into your wall, and all this kind of thing. So I think that like, in essence, looking back on this time, we're going to feel like 2019, 2018, we're going to hit the informational bedrock, like we didn't know anything that was going on in the world before 2018, relative to the future. And I think that that perspective, which is that it's like we're opening our eyes, and seeing what's happening in the world, at a totally new level of resolution is actually going to be a much more apt description of what the machine learning revolution brings. Interesting. All right. Interesting answer. And the final question is simple. It's basically what's the biggest challenges that make it hard to take machine learning models and deploy them in the real world? I think every machine learning tools entrepreneur will tell you that it's whatever their company is doing. And I think that's a totally legitimate answer, by the way. So I suppose you'll tell me it is experiment tracking and hyper parameter search and- How would you answer that? And I think it's legitimate, you're clearly solving a huge pain point for people. What is that piece that requires Streamlit? Yeah. So I saw this at Google, at Google X. I saw this at Zoox, at [inaudible 00:47:37]. It's the machine learning teams and the data science teams are actually the gatekeeper to this really fascinating and exotic storehouse of stuff. Like data sets and models, and predictions of the future. And that is actually very difficult for other people to get at. The state of the art is often like, I'm going to send you an email and just do a one off exploration in Jupyter notebook and tell me the answer and paste it into a PowerPoint presentation. Like, that's a lot of how the rest of the company interacts with the data science team and the machine learning team. And that's kind of insane. It's so inefficient. And so I think that the aspiration that I have for StreamLit is that almost as a byproduct of existing workflows, the engineers working on those teams are empowered to sort of bring their work directly. Inject it into the entire company and allow the whole company to make decisions and predictions and stuff in the same way that they can. I think it would have a big, big, big impact and it already is starting to. Yeah, awesome. Well, thanks, Adrien. It's great to talk to you. Well it's really fun. When we first started making these videos, we didn't know if anyone would be interested or want to see them, but we made them for fun. And we started off and making videos that would teach people and now we get these great interviews with real industry practitioners. And I love making this available to the whole world so everyone can watch these things for free. The more feedback you give us, the better stuff we can produce. So please subscribe, leave a comment engage with us. We really appreciate it.",8239 +Peter Norvig – Singularity Is in the Eye of the Beholder,https://www.youtube.com/watch?v=hVW1mwLtDcI,2831,2020-11-20,"So one thing is singularity is in the eye of the beholder. Sure. So if you're Kurzweil all the curves are exponential and they're going up. And right now is a special time. But if you've got log paper, then all the lines are straight lines and there's nothing special about right now. It was a straight line yesterday and it'll be a straight line tomorrow. And so I guess that's more of my philosophical point of view. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. I've admired Peter Norvig for a long time. He's a director of research at Google and before that, he directed Google's core search algorithms group. He also wrote Artificial Intelligence: A Modern Approach, which is probably the most famous textbook in artificial intelligence and the one that I used when I was first getting into the field. Prior to his work at Google, Norvig was NASA's chief computer scientist. I could not be more excited to talk to him today. Peter, thanks so much for taking the time to do this. You keep coming up in my life. I love doing Project Euler and you have an incredibly useful set of Python libraries, like little Python libraries that really help with the Project Euler challenges. I was wondering how many of the Project Euler problems that you've done and if you have a favorite one or one that was memorable to you? Yeah, I guess I lost count of how many I've done because they had a problem once where they had a security breach and then they changed all the passwords or something. And if you didn't get the message then, within a month or two, then your account got locked out. So that happened to me. And so I lost sort of my main account. And then every now and then I did one or two, but I lost contact with that. So I've been doing other things like Advent Of Code and so on, but my utilities are still out there. I might sometime actually publish my answers because originally they said they didn't want anybody publishing answers, but I think they've given up on that because they realized that they're all out there anyways. So maybe I'll go back and clean those up a little bit and publish them someday. I guess I had a lot of fun with the Monopoly simulation just because I could get it down to a half a page. And it seems at first that the rules of Monopoly are really complicated, but if you're only worrying about the part about the question was asking what's the probability distribution over the squares. If you're only worried about that part, you can simplify it a lot. So I remember that one as being fun. I remember doing that one. Do you have a favorite, in Advent Of Code and your pytudes repository, do you have a favorite piece of work there? I guess in the pytudes, I like a couple of different ones on probability. I think I have like three different notebooks there. And one of the things I think is really interesting is I've changed the way I think about probability, or at least the way we teach it. And one of the things that really struck me is I went through and I said, ""Here's a bunch of problems, the typical kinds of things where you're dealing cards or you're picking colored urns out of a bowl or marbles out of a bowl or whatever."" And then they came to one where in the textbook, they said, ""Okay, now this one's going to be completely different because this is a Bayesian one where you have to reason backwards."" And you show the math, and there's like 10 lines of equations and you get the answer. And I was going through and I said, ""I don't have to do anything different. I can solve this in exactly the same way I solved all the other ones. And there's nothing special about it being Bayesian."" And the reason they said it was special was look, you can solve it in only 10 lines of math using this formula. And if you didn't have that formula, you'd have to consider a million different cases. And that would be completely infeasible. But I looked at it and I said, ""Well, this is so simple. There's only a million different cases. So why do anything different? Why not just think of it as exactly the same as all the other questions? And only this time the computer has to enumerate over a million values rather than over a hundred values or whatever."" So I thought that was interesting. It changed the way you think about how you solve these problems. And all of a sudden, problems that look like they were different are actually exactly the same. Just go ahead and enumerate them. As long as they're discrete problems, you can almost always do that. Reminds me of that problem, have you heard of the problem that apparently it didn't fool John Von Neumann, where the dog runs back and forth between the two people that walk together? And he just summed the infinite series... So he's got a gigahertz computer in his head, so he could solve it a different way than everybody else. Maybe we should call that the problem that didn't fool Peter Norvig. So one thing I really was curious about is I saw the new version of your famous textbook came out this year, right? Yeah. The Artificial Intelligence"" A Modern Approach. I was wondering, I feel like one thing that everyone talks about is how hard it is to stay current on all the kind of topics happening with the vast amount of research. And I'm wondering what the process was like for you to even just picking the topics that felt relevant for a textbook. Yeah. So it's really hard, both in terms of staying up to date, because things are moving so quickly, both because just it's an exciting time and things are happening and also because the process is a lot faster now, because previous editions, we only had to read the journal articles and the conference proceedings. And now you have to look at archive every day. Right. So things move a lot faster. I remember I complained to David Silver of Deep Mind and AlphaGo. I said, ""Here's the third time you made me rewrite the chapter on games and I better finish the book quick or else you're going to do it a fourth time."" Were there any topics that you kind of regretted not being able to include in the book? I guess I regretted not having better answers. I think things, they're still in a state of flux. So we added a chapter on probabilistic programming, but I liked to have had a better advice for how that fits in with other types of machine learning. And when do you take a model based approach and when do you take a data based approach? I think we don't quite understand that trade-off yet. We kept a lot of the old material, even though we know people are skipping it. So professors aren't teaching resolution theorem proving anymore, but we kept that in. We cut down that material quite a bit, but it's still there. And I think the idea of understanding how representation works is important. And I think that's really what deep learning does is it invents representations and it's able to build on those because it is deep and has multiple layers. And it's just that we don't quite understand what those representations are. I think that's important and it's important to see what the possibilities are. So I think those chapters should be there, but I wish we had a better take on the story of how they all fit together. Now it feels like here's a bunch of different technologies. You should know about them, but it's up to you on how to integrate them. What percentage of the book at this point is what we would call machine learning? Probably a third of the book is sort of officially about machine learning concepts, but then all the application areas are very heavily into machine learning. So we have natural language. We have computer vision. We have robotics. And all of those are very much dominated by machine learning. What parts changed the most since ... It was like 2009 that the last version came out. Where were the biggest changes? So all the deep learning stuff is new. And then I think the philosophy part is new. And also I guess the practical parts. I mean, we didn't really try to say this is everything you need to know to be an engineer, but we try to give some hints of what's practical. So all of these issues around privacy and security and fairness and how to do testing within the machine learning model, that was new. The philosophy part, we had a philosophy chapter before, but it was mostly like Searle's Chinese room and things like that. And now it's much more focused- Yeah, I remember that. Yeah. Now it's much more focused on unfairness, autonomous weapons and these much more important issues. Interesting. One thing I always wonder about with the AI ethics stuff is it seems like a lot of it would generally apply to all technology. Any technology in a weapon seems scary and fairness seems like always a problem. I mean, how much do you feel like these issues are AI specific? Yeah, I think that's exactly right. In fact, I was involved once with yet another one of these efforts to lay down an ultimatum and a set of AI principles. And I did work on the Google set and I think we did a pretty good job of that, but I was involved in another group doing that. And at one point I just said, ""I don't think we need this. I think these things have already been said elsewhere. And I think, as you say, a lot of them are general principles for engineers and some of them are general principles for anybody living in a society. I think we need less of sort of AI and machine learning specific principles and we need more at either a higher or a lower level. Right? So we need it for engineers in general, and there are some codes of that for engineers. And then we need much more specific ones because if you try to do it at the machine learning level, then you have these very vague principles of what's privacy and what isn't. I would rather say, ""Okay, well now we're going to have it at the machine visions level and we're going to have a set of principles for them."" And then you can start asking not just what privacy is good, but you should say, ""Well, what exactly can I do with face recognition? And what are the limits there?"" And then it's easier to formulate that if you're at a much more specific level. That makes sense. Another thing that you wrote that I really loved was your unreasonable effectiveness of data paper quite a long time ago, which was one of the inspirations for me to start my last company. And then I saw Google came out with, 10 years later, basically it sort of showed that it still holds up. Still even more data seems to be even more effective. I'm curious if you feel like anything that's happened with data since then has been surprising or there's anything from those observations back then that yo would actually take back? Because when I reread it, it feels so completely true, like almost shockingly true, I guess 15 years later. I don't know if I would take it back, but I think it promotes one point of view that I think was important at that time. But there's another point of view that says that data isn't everything. And I think that needs to be said as well. I mean, everything has to be balanced out. Some of that has to do with these things we were talking about with the privacy and security and fairness and so on. We still found places at Google where more data helps. We've also found places where data is a liability rather than, or maybe as well as, an asset. Interesting. What's a place where a data feels like a liability? Do you have any stories around that? We're doing a lot of effort in this federated learning where we say, ""Well, maybe we don't want to hold on to your data. Maybe that's too sensitive."" Particularly for things like speech recognition, where we do want to build a model that works better for you personally, but we don't want to have your private conversations in our data center because that's a liability. If we screw up and reveal them, that's a bad thing. So we'd rather build the models on your device. So we'd rather build the models on your device and then have a way of sharing the parameters of that model without revealing any of the underlying data. And so a lot of work has gone into those kinds of approaches. Although I guess that's still a case where data is useful. It's just trying to sort of use the data in a comfortable way. Yeah. It's still useful, but it's trying to figure out the best way to be safe and willing to compromise some in order to achieve that. I guess, what else would you say under the theme of data isn't everything? What other points would you make to argue that side? Well, I guess another thing people are concerned with power and now, especially on your small devices. And so you could say, ""Well, the best thing is this huge model with billions of parameters, but now I want it to run on the phone. So now I'm going to build something that throws away 99% of the data, but hopefully performs 95% as good."" And so a lot of work is into that kind of approach. It's funny. I guess I think of you as kind of a tinkerer. I don't know if that's right or wrong, but I guess, for me, as maybe a fellow tinkerer, it seems a little sad to me that a lot of the AI breakthroughs have really been through applying massive compute. I mean, maybe that's just a fact of life, but it does seem like it might really inhibit research if you need millions of dollars to build a breakthrough model. Yeah. I think that's definitely an issue. And certainly we've seen that. These GPT models take a long time to train and a lot of computational power. So both in terms of the expense that it takes to do that and just the availability of who can work on it, those are definitely issues. I think that as we get better at transfer learning, that some of those issues will go away, and that the typical way to do things will be to say, ""We'll start with this model that's already been built, and then we'll modify it a bit."" And so the expense for doing that should be a lot less. And certainly there are possibilities now as a lot of the cloud providers are offering credits for researchers and so on. So some people can get through and get access to that kind of power. But of course, not everybody can, and you've got to have some way to prove yourself worthy. And I don't know if that selection process is always fair. Right. I mean, I guess that leads me to another question down my list, which I'm sure a lot of our audience is going to wonder about, which is, just because you've been so successful in your career and your career has lasted so long, do you have advice for a young researcher maybe starting a PhD program or coming right out of undergrad? What would you guide someone to work on, or what would you work on if you were in that position today? I guess I'd probably try to understand biology better and work on that because I think there's a lot of opportunity there. There's a lot of data. It's important for the current COVID situations. That's obviously one big application. All aspects of health are important. So for me personally, I'd probably do that. But my advice would be find some area that you're interested in and concentrate on that. We've had a couple of different biologists that are different lenses come and talk on this. Is there any new research in biology that you're seeing that's particularly exciting or predictions you have about biology and machine learning? There's lots of different areas. Understanding human health and personalization, I think, is important, and I think we're just starting to do that. Understanding the genome protein folding and drug discovery and understanding how neurons work, I think, is important. Recently, we've seen a couple of cases of people that have published connectomes of various organisms and so on. So we're just starting to be able to see maps of that, and we're starting to get better tools to understand that. When you look at deep learning, it sort of feels like that came suddenly, but a lot of those techniques were around, in fact in your book, I remember quite far back. Do you think that the field missed something, or was it just not possible to run at the scale necessary to show that these neural network techniques were working better than people expected in the early aughts? Yeah. I mean, if you say suddenly, right, we've got a sudden leap in computer vision and image net after Hinton had been trying the same thing for 30 years, right? Right. And then it finally worked. And I think the biggest difference was the computing power. Definitely there were advances in data. So we could do image net because Fei-Fei Li and others gathered this large database, and that was really important. There are certainly differences in the algorithm, right? We've got a slightly different squashing function. Instead of shaped like this, it's shaped like this. I mean, I don't know how big a deal that was, but we learned how to do stochastic gradient dissent a little bit better. We figured that dropout gave you a little bit better robustness. So there were small things, but I think probably the biggest was the computing power. And I mean, I certainly remember Geoff Hinton came to Berkeley when I was a grad student in 1981, I think, when he talked about these neural nets. And we fellow grad students thought that was so cool. So we said, ""Let's go back into the lab and implement it."" And of course, there was absolutely nothing you could download, so we had to build it all from scratch. And we got it to do exclusive or, and then we got it to do something a little bit more complicated. And it was exciting. And then we gave it the first real problem, and it ran overnight, and it didn't converge, and we let it run one more day, and it still didn't converge. And then we gave up, and we went back to our sort of knowledge-based systems approach. But if we had the computing power of today, it probably would have converged after five seconds. So I remember Daphne Koller telling me, maybe 2003, that the kind of state-of-the-art handwriting systems were neural nets, but that it was such an ad hoc kind of system that we shouldn't focus on it. And I wonder if maybe I should have paid more attention to that and tried harder to make neural nets work for the applications I was doing. Yeah, me too. And certainly Yan LeCun had success with the digit database, and I think that was over-engineered in that they looked at exactly the features they needed for that set of digitizations of those digits. And in fact, I remember researchers talking about, ""Well, what change are we going to do for sample number 347?"" Right? Oh, really? Okay. There were individual data points that they would perform theories on, so that was definitely over-tuning to the data. And it should have been an indication that was a good approach. It was better than other approaches at the time. I guess so. Although that does sound like damming level of over-fitting the data, I suppose. Right. There was only a couple thousand data points. I forget exactly how many. Maybe it was 10,000. Maybe it was even 100,000, but it wasn't many. I guess more broadly, when you think about what you were thinking about at the beginning of your career and imagining into the future, what's been surprising in the development of artificial intelligence? I guess when I started, I did it because I thought it was really interesting, and it was an academic approach. And I guess I was surprised at how much it's had an impact on everybody's everyday life. That wasn't something I was expecting. I mean, I knew it was probably a more practical field than 13th-century Italian poetry, and I figured my salary is probably going to be higher going into this field. But I still thought of it as an academic challenge that was obscure and not as something that would touch everybody's life every day. I guess, has there been any approaches that you thought wouldn't work but then worked better than you expected? Yeah, I guess, in general, people are surprised that these deep-learning approaches work as well as they do and as wide a variety that they do. And I grew up at a time when there was a real emphasis on saying, ""We need to understand representations inference,"" and focused on that. And I think that's still true. That's still important. But I think we learned a couple of things. One is that you can do more with just the pattern recognition. And maybe we were exhibiting some speciesism of saying, ""We're humans, and we do a lot of this higher-level reasoning. So maybe that's the really important thing."" But there's lots of other animals that live long lives and do a lot of cool stuff, but without having a lot of that higher-level reasoning and long-term planning, and they can do short-term plans. That's a good point. And they're not thinking about that in the same way we are. So I think we kind of missed that, and my hope for the future is that we can bring those back together. So I think it is a good idea to be able to do reasoning, to form representations, to simulate into the future and choose courses of action. I think where we went wrong is that we were so seduced, like first-order logic, in saying, ""Oh, it's got such a cool theory of inference."" But the problem is once you get outside of mathematics, this idea of kind of fixed predicates that are either true or false just doesn't hold up, right? So yeah, we can define what a triangle is and say, ""If your two sides are equal, then your opposite angles are equal."" And we can reason through that. But once we get into, okay, now you're driving a car, and we say, ""If there is a pedestrian on the sidewalk, then what?"" Well, first of all, we don't know for sure it's a pedestrian. All we've got is a point cloud. And secondly, every sidewalk is different. All the predicates are vague, and all of the situations are unique enough that this kind of if-A-then-B reasoning falls down. So I'd like to get back to something where we combine this, ""I'm going to do pattern recognition, I've seen something like this before, what's similar?"" But also some ability to say ""Yes, and in addition to all these neuron weights that I'm seeing, I can also extract something that's abstract. And I can reason forward a little bit, as long as I don't take it too seriously."" Right? Do you think that the reasoning needs to be something that is kind of understandable by a human? I think that helps debugging, but I don't think it's necessary. So there's a couple problems. One is we trust it more if we can understand it. It enables us to debug it and enables us to take the advice more seriously if it's talking our language. But there's no reason it should because computers have different powers, so they should think different ways, right? And I'm sure B has a different representation of the world than I do because its visual system is so different, and we shouldn't try to have the same approach. On the other hand, it's certainly possible that if we don't understand what they're doing, that they'll solve the problem in the wrong way, right? And so I saw something yesterday. They were trying to distinguish between Husky dogs and wolves and trying to figure out what the salient features were. And they decoded that said, ""Well, one of the most salient features was whether there was snow or not."" Right. Right? And that's good if the only thing you're trying to do is maximize your results on this particular dataset, but it hasn't really helped you solve the real problem. And so I think we have to be wary of those kinds of accidental coincidences that our machine learning systems are very good at picking up at. And I guess part of that is having a better theory of how the future is going to be different than the training data, right? We can easily imagine, okay, here's all these pet dogs that are inside house- ... can easily imagine, here's all these pet dogs that are inside houses. Could they be outside in the snow? Well, sure, of course they could. But our machine learning systems don't pick up on that. It does seem like though, and when you think about even AlphaGo or Alpha Chess, that they're successful enough that it feels like they must be building some kind of higher level representation within the models that they're building. Do you think it's possible that you could take the types of algorithms we have now and make them bigger and add more data and they'll sort of build higher-level representations that make them functionally similar to human like intelligence or do you think there's some real change or different methods needed? So that's a good question. Part of it is I don't think we really understand how powerful it is to have a perfect memory and have gigahertz level reasoning capabilities. And I think you can do a lot with it. We thought would take much more complex reasoning, but it doesn't. And so I think that comes up a lot. I remember a very good Go player saying, ""I can't beat AlphaGo even with the search turned off."" So- The search turned off? Yeah. Not look ahead. It doesn't look ahead, but it also just has one network that says out of all the possible moves, which is the best. And if he can't beat it only with that turned on. So it's doing something there, right? It's not just that it's very good at searching and figuring out where to search, it's that it has some abstract representation of what a good move is or isn't. But nobody quite understands what that representation is, both when it's playing alone and then also when it's combining it with the forward search. So there's something definitely going on there and we're not quite sure what it is. I think there's been some interesting hybrid approaches. So one of the things I think is interesting is they're improving, which is one of the few places where logical reasoning actually works. But if you talk to mathematician, this combination of following the rules of inference and then some intuition. And there's some work on trying to combine that. So Christian Szegedy and Sarah Loos have the system where you take sort of a regular theorem prover and you give it a problem. And then you have a neural net decide out of the million axioms I have, which 100 are most relevant to this problem. And then you feed those axioms to that theorem prover and now it's able to brute force prove. Whereas if he gave it all the axioms it would get lost and it wouldn't be able to find it. So I think that's a nice way of saying mathematicians have two things. They have the power to correctly under rules, and then they also have intuition of, I think this is the way it's going to go. So I'd like to see more of that kind of approach, where you have these very powerful general techniques that you can call on but then on top of that, you try to learn the patterns for how to use them. Another example I think about is we have things like mathematical induction where if it's true for one and it's true for N plus one, then it's true for infinity. And that's great in math, but in the real world it doesn't work. And we have these paradoxes like, well, I have a mountain and then I take away a grain of sand, is it still a mountain? Yes. Well, what if I do that an infinite number of times, then it's no longer a mountain. When does it not become a mountain, right? So we don't quite have answers to that. The way I would approach problems like that is say, well, you got to learn two things. One, you got to learn this rule of, I can take away a single grain of sand and then you have to learn the applicability of saying, well, for sand, I can't do that too often, but for integers, it's fine to do it an infinite number of times. And we as people, we figured that out, but we don't have good ways of saying that to our computers. I think it'd be interesting if we could figure out how to say that or to teach them that. Interesting. You imagine teaching computers sort of facts at large scale? So here, I guess, I'm talking more about control strategies or applicability effects. I guess you touched on this to some extent, but another question I have just because you've been doing this for longer than most, when you look at the applications, it seems like some applications have turned out to be much easier than others, and it's been pretty surprising. Do you have a sense of things that actually surprised you? Because you've actually, I think, been very good at predicting the difficulty of different problems to machine learning. Have there been any applications that have been surprisingly harder, surprisingly easy throughout your career? Well, we still haven't quite figured out the self-driving cars and I'm not sure how surprising that is. And certainly people made predictions that we'd have it by now. I guess I'm not that surprised just because I think it's so complicated and there's so many different possibilities and that the stakes for going wrong are so high. One thing that's surprising to me, slipping away a little bit from your question, if you had asked me 10 years ago, would it be a good idea to voluntarily give up the keyboard and the screen attached to your device and just have a speaker sitting on the shelf that you can talk to? I would say, ""No. That's dumb. Why would I want to give up all these good input and output device and just to have that? That's crazy."" But some people like that for a lot of things. And so I thought that's interesting. And why did I make that prediction wrong? And maybe it's because I'm too tied to the devices and would prefer to not be as tied to them. I still think we have a long way to go with these assistants that are getting pretty good at recognizing your voice. And I can tell it to play a song and I can ask it for the weather report. But then there's 10 more things I can do. I can ask it for a recipe and so on. But after a dozen or so, things, now I'm stuck and now I'm not quite sure if my next query is going to work or not. And I like sort of the security of you open a new desktop app that you haven't learned before, but you can poke through the menus and you can get a good idea of what you can do and what you can't do. But with these speech based assistants, you have no idea what's going to work and what's not going to work. And so I think that's interesting. And so either we have to have a better theory of how we teach people what they can do, or we have to fulfill this promise of, well, it's just like talking to a person and you can say anything. And we haven't fulfilled that promise yet and we haven't given people a good model of what works and what doesn't work. So I think that's a real challenge. One fact that seems remarkable to me even though it's so clear and we see it all the time is how little machine learning seems to work in the real world with robotics, right? I think it's incredible that computers can beat the best person at Go, but might have trouble picking all the ghost towns off the board every single time. I think we're getting better at robotics. I guess it was... That's a good example of something that was harder than we thought. And I remember Marvin Minsky saying, ""Oh, it's a waste of time working on robotics because these trivial little stuff is so hard you'll never make any progress. If you want to make a PhD, do it in simulation rather than do it in robotics or else you won't graduate forever."" And I think maybe that was good advice for people wanting to get a PhD but I think it was bad advice for the field as a whole, because I think these problems were hard because they're important not because they were this trivial thing off to the side. And I guess if you applied that advice everywhere, you'd never do anything new, right? Synthetic data is something that seems really interesting and promising. Is that something that you covered in your new book? A little bit, yeah. And I think that's important and certainly very important in robotics. And I guess particularly in computer vision is the easiest place to come up with synthetic data because we really understand how optics work pretty well. And if you want to say, take this image and rotate it or put it under different lighting conditions and so on, we know how to do that because we understand the physics. Other kinds of data, we don't necessarily know how to do that, right? So we make synthetic data by, let's make some random changes and hope they're not too bad. But if you don't have a strict physics model, you can pull yourself to some degree. But I think that's important and we've done that a lot. In robotics, I think the sim to real transition is really important, right? Because a lot of times you're limited in doing things in real time, but simulations can run much faster. And now you'd have to make sure that the simulations are calibrated and work into real life. And I think for the most part, we've had pretty good success with that. And some of that takes a long time, right? So you got to have the vision models, you got to have the physics models. And I say part of the problem with self-driving cars is the first billion miles are always the hardest because you can't build a good simulator until you've been on the road and seen 1000 new really weird things that you never would've thought of putting into your simulator. And once you have that, then progress is going to be 100 times faster if you can run in simulation rather than having to run on the real road. Right. Right. What about then language? We were talking to Anthony on this same show from Kaggle a month or two ago. One of the things he told me that just really surprised me was that the winning Kaggle strategy in some of the language tasks is to use Google Translate to just take the sentence or document, translate it into some foreign language, translate it back into English. And so you get some natural changes, but maybe the underlying semantics is somewhat preserved. And I guess that really works as a synthetic data generation strategy. Yeah. Yeah. Yeah. So that's interesting. And I guess people also just break language up into pieces. I think that's an area where transfer learning has worked pretty well. Where of course we got probably even better than vision. We've got lots and lots of language available, so it's easy to find stuff to train on. But one, it's not on the topic area that you're dealing with, but we've found, in a lot of cases, it still helps a lot to have that. So that's really useful. And then we also found, I guess this was really surprising to me, that transfer across tasks worked really well. So you can train on question answering and then that helps you do a summarization for going across different tasks. And I guess that was a little surprising to me. Okay. I feel a little self-conscious asking this question, but I need to ask it. I mean, what are your thoughts on singularity? Do you believe in some form of that? AI is going to be better than humans at all tasks and then continue to improve? Is that something you imagine happening? Yeah, not really. So one thing is singularity is in the eye of the beholder. Sure. So if you're Kurzweil, all the curves are exponential and they're going up and right now is a special time. But if you've got log paper, then all the lines are straight lines and there's nothing special about right now. It was a straight line yesterday and it'll be a straight line tomorrow. And so I guess that's more of my philosophical point of view. That things will get better but... Actually I did talk at one of the singularity conferences and I tried to answer that question of, is this a special time right now? - answer that question of is this a special time right now? And the way I did that was linguistic research, is I did a search over past machine learning papers and broke it up into decades and I searched for a couple of key terms like, ""Unlike other systems, our system ... "" And so I found all of the sentences that were like that, and then by hand, I sort of made a histogram of what was the breakthrough, and the answer was there wasn't anything special about right now, and lots of the same breakthroughs were being made 20 years ago. Somebody said, ""Like other systems, our system does X,"" and today, some of the same things, the same X. The same X, the same value of X. What was the common value of X? They were all over the map. Interesting. So my answer was we still don't know what we're doing, and I think the other thing is I think people talk about these hard takeoffs and soft takeoffs and so on, and I think everything's going to be gradual and we're just going to get used to it. I mean look at the changes we've had already. So now everybody walks around with a device that has access to all the information in the world, and that seems like that should be a huge thing that's really different, and yet mostly we say, ""Yeah, well what's the big deal? Of course, everybody has that."" So I think that's going to happen in the future. So there'll be robots you can talk to and can have real conversations and they can do things for you and people will just say, ""Well, yeah. It's just another thing I have. I have my phone, now I have my robot. It doesn't really change my world that much."" So just to ask it more concretely, do you expect a world where that robot is smarter than you in every way? No, I don't think so, and - Because even with a straight line on log paper eventually. Yeah. Yeah. I mean I could see that argument eventually, but I'm older than you. So I don't have to predict out as many years into the future. I'm very confused about my predictions. On the one hand I say I don't think there's going to be these big changes coming, and so if I had to bet, I would bet against people like [inaudible 00:38:25]. On the other hand, I look at past predictions and I'd say, well, [inaudible 00:38:27] has probably done a better job than me, so I should probably bet for him. And I haven't quite been able to figure out that contradiction. Interesting. All right. Well, here's another kind of a little bit off the wall topic, but it's been surprisingly interesting with various guests. Do you think that Python will continue to be the kind of main programming language of ML for the next decades? Or do you think something else is going to come along and unseat it? Yeah, I don't know. I've been doing Python for awhile and I came to it actually because of the textbook. So we had the textbook, it's got pseudo code in the book because we didn't want to tell a professor what language to use. When we did the first edition in 1995, we implemented all the pseudocode in List because that was the style in AI at the time. Then over the years List started to fall out of favor and students would complain, ""We don't understand how this code works. What's all these weird parentheses doing there?"" So I said, okay, I got to reimplement all the code from the book in some other language, and I said, ""Well, what's the most popular language? Java, I'll do that."" So then I said, ""Oh, well this is such a mess, that I can't take the pseudo code and implement it directly. I can't just have let's create an object X. First I need an X factory and then just ..."" Right, right. I just got complicated. It wasn't a good match to the pseudo code we had written. Yeah. Instead of saying what's the most popular language, I said, ""What's a pretty popular language that's the best match to the pseudo code?"" And I didn't know that much about Python, but I looked at it and just said I must have been cheating and channeling Guido when I came up with my pseudo code, because my pseudo code with almost exactly Python. So I said, ""This is going to be the easiest thing to do, so that's what I'll do,"" and it turned out that was a good choice in terms of the popularity for Python really started to grow and I think that's important. I think there are some limitations. So looking at where we are today, I guess I would be happier if Julia was the main language. Python's starting to have type declarations now, but they don't quite take them seriously. Julia does a much better job of that and Julia was written to be more efficient sort of from the start. So I think that's probably a better choice. I guess some people are using Swift. I don't know too much about that, and there are other languages, but I think the popularity is going to be more important than the difference between them. And if the language is popular, people will put in what's necessary to do it. So you look at JavaScript, it was very a rushed language design, so I don't really blame the designers for that, but there's a lot of weird stuff in it, and yet, because it was the only thing that you could run in the browser, people ended up coming up with really good compilers for it because they had to, and I think that hasn't quite happened with Python yet. I'm a little bit surprised at that. I guess probably the reason is because we didn't have to. So right now the Python compilers aren't the greatest and I think that's not necessarily because of the language design, because I don't see anything that's that much different between Python and JavaScript, but it's just that it was necessary to have a fast compiler for JavaScript and it's not necessary to have one for Python, because in the browser you have no choice, but outside of the browser you could have used D+ or Rust or something else. So it's not as necessary that Python becomes as fast. And we may end up having splits with things implemented in different languages. As long as they interface with each other, that's probably okay. But you still write most of your code in Python, right? Yeah. Yeah, I do, in part because a lot of what I do is teaching focused and Python is good, one, because it's what's taught in a lot of the schools, and secondly, as I was talking about before, there isn't a lot of [inaudible 00:06:26]. So if you've just trying to say, ""Here, I'm trying to show an algorithm,"" Python is very good for being a direct implementation of that algorithm without a lot of other stuff you have to worry about. If I was worried about efficiency, I'd probably be using something else. Interesting. That's a really interesting reason to pick it because it looked like your pseudocode. I can totally picture that. Yeah. That is a great thing about Python. We always end with kind of two a little bit open-ended questions. Feel free to take them where you want, but one is is there a topic in AI or machine learning that you wish people would pay more attention to? I guess I would pay more attention to all these peripheral issues of fairness and equity and privacy and security and operations. So we have this term of MLOps now, and I think that's good, but I think people should pay more attention to the whole life cycle of the product rather than just say, ""I'm trying to get the highest possible score on my test set."" Well that's an incredible segue into my final question, which is when you look at deploying machine learning in the real world ... and I guess this is MLOps, where do you see the biggest bottlenecks or challenges or problems? Yeah, so there's a lot of them. I guess one of the biggest ones I face continuously is drift. So the data changes, the users needs changed, and you have to have some way of monitoring that and responding to it. And I think we've had 50 years or so of making better tools for software engineering and so we're much better at that now. It's harder to insert a bug into a program that it used to be, but we don't have much of that for machine learning systems. They only have a couple of years, and you've been trying to contribute to that and that's great. And we have some tool sets, but I think we're still far behind, and so we run into these problems. One of the things I see continuously at Google ... and we're a big place, so it's easy for a team to say, ""I need some data that does X. Oh look, here's another team that's producing that data asset. Let's plug it in and try. Look, it works great. My problem is solved."" And then six months down the line, things are just slowly getting a little bit worse every day and nobody knows why, and eventually someone figures out this other team that, for a while the two of were on the same path and they were producing data that worked for them and worked for us, all of a sudden they veered off in one direction and we veered off in another direction. They made small changes a little bit at a time to the data they were producing, and for them, those changes were updates, and for us it made it worse, and we found we don't have good ways of tracking all that. So sometimes the team didn't even know somebody else was using their data. So they didn't pull and say, ""Hey, is it okay if we make this change?"" They just said, ""It's good for us. We're going to go ahead and do it,"" but it hurt someone else. And there's all sorts of ways in which the world changes and drifts, and I think we built a software engineering approach where we say you make a change, you get it reviewed, you run all the unit tests, you check it in, and these changes are relatively bigger events and open a level of individual check-ins that have to get reviewed and at a major product, a number of releases where they only happen a few times a year and they're big things. But with machine learning, everything's changing every day as you're getting new data and you can't go and say, ""Well we're going to do a complete test of everything every time we get a new data,"" but you have to have some process that says, ""What are we going to retest? At what level? And what are we going to monitor for? And how do we know when the world has changed out from underneath us?"" And I think we need better tooling to get that right. Awesome. Well thanks so much, Peter. It's a real pleasure to talk to you. Yeah. It was fun to talk to you, Lukas. Thanks for your time. When we first started making these videos, we didn't know if anyone would be interested or want to see them, but we made them for fun and we started off in making videos that would teach people and now we get these great interviews with real industry practitioners and I love making this available to the whole world so everyone can watch these things for free. The more feedback you give us, the better stuff we can produce. So please subscribe, leave a comment, engage with us. We really appreciate it.",8617 +Robert Nishihara — The State of Distributed Computing in ML,https://www.youtube.com/watch?v=q83RkjRKS5M,2118,2020-11-13,"We have all these machine learning researchers some of them have backgrounds in math or statistics, things like that. And they want to be spending more of their time thinking about designing better algorithms or better strategies for learning, but actually quite a few of them are spending quite a bit of time on the tooling side. Or like building better tools or scaffolding for doing fairly low level engineering for speeding things up or scaling things up. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. Robert Nishihara is the CEO of the company that makes Ray, a high-performance distribute execution framework for AI applications and others. His Ray project came out of his work at the Rise Lab at UC Berkeley. And prior to that, he studied mathematics at Harvard. I'm super excited to talk to him. So I'm curious about how Ray came to be and how you think about it, but maybe before we go into that, if you could just kind of give a high-level overview of what Ray does and why people use it. At a high level, the underlying trend that is giving rise to the need for Ray is just the fact that distributed computing is becoming the norm. And more and more applications, especially applications that involve machine learning and some capacity needs to run on clusters. They're just not happening on your laptop or a single machine. And the challenge is that actually developing and running these distributed applications or scalable applications is quite hard. When you're developing these scalable applications, you're often not only building your application logic, like the machine learning part. You're often also building a lot of infrastructure or scaffolding to run your application. And we're trying to make it as simple as developing on your laptop, essentially to let people focus just on building their application logic, and then be able to run it anywhere from your laptop to a large cluster. And take advantage of all the cluster resources, but without having to be experts in infrastructure. What's the real challenge of making that work because it's absolutely... As probably more of an ML person than the DevOps person. They'll probably kill me for even like thinking this, but conceptually it seems like a pretty simple idea. So what makes it hard to actually abstract away the underlying distributed system from the ML logic? A lot of the challenges are actually being general enough. If you have a specific use case in mind, of course you can build a specialized tool for that use case. But then the challenge is that it often doesn't generalize to the next use case you have. Like maybe you build some setup or some infrastructure for training neural networks at a large scale, but then you want to do reinforcement learning and all of a sudden you need a different system or all of a sudden you want to do online learning and it's different. The challenge is really trying to anticipate these use cases or without even the future use cases will be trying to provide the infrastructure that will support them. Ray achieves this by being a little bit lower level than a lot of other systems out there. So if you're familiar with tools like Apache Spark, for example. The core abstraction that Spark provides is a dataset. And it lets you manipulate data sets. So if you're doing data processing, that's the perfect abstraction. If you look at something like TensorFlow, the TensorFlow provides the abstraction of a neural network. So if you're training neural networks, it's the right abstraction. What Ray is doing is it's not providing a data set abstraction or a neural network abstraction or anything like that. It's actually just taking more primitive concepts like Python functions and Python classes and letting people translate those concepts into the distributed settings. So you can take your Python functions, execute them in the cluster setting, or take your Python classes, instantiate them as like services or microservices or actors. And in some sense, the generality comes from the fact that we are not introducing new concepts. So that enforcing it to course your application into those concepts, we're taking the existing concepts of functions and classes, which we already know are quite general and letting and providing a way to translate those into the distributed setting. So what's something that would be painful to do in Spark, but it would be easy to do in Ray? For example, training neural networks, or building AlphaGo or building an online learning application or deploying your machine learning models in production. Those are some examples. Now let's take like building AlphaGo as an example. That does seem to me like a pretty... Maybe this is going to annoy you, but it's like a super naive question maybe in a challenging way. But AlphaGo seems almost like a very embarrassingly parallel learning problem. It seems like you could run a lot of learning at once and combine the results. Why wouldn't that work on Spark for example? There's a lot of subtleties. So if you're implementing something like AlphaGo, yes, you are running a lot of simulations in parallel. And then that's one part of it. You're also doing a lot of gradient descent and actually updating your models and these things, each of them individually are embarrassingly parallel perhaps. But one of them is happening on GPS. One of them is happening on CPU machines. There's sort of this tight communication loop between the two where you take the roll outs and stuff that you do. The money college research and pass those over to the training part. And they take the new models from the training part and pass those over to do the rollout. So there's a lot of this sort of communication. And a natural way to express this is to have these kinds of stateful actors or stateful services that have these machine learning models that are getting updated over time. And the way it's often natural to express these things is with stateful computation, which is different from what Spark is providing. So those are a couple examples. Is there something specific about reinforcement learning? Because that is actually your background, right? And it seems like that might've been some of the impetus for making this. Is there something core to reinforcement learning as opposed to like supervise learning that makes this more necessary? I think the one reason we focused on some reinforcement learning applications initially with Ray is that... Well, beyond the fact that it's an exciting application area is the fact that it's quite difficult to do with existing systems, right? So when deep mind is building AlphaGo or when OpenAI is doing DOTA, they're not doing it on top of Spark. They're not doing it on top of just TensorFlow. They're building new distributed systems to run these applications. And part of the reason for that is that reinforcement learning combines a bunch of different computational patterns together. Yes, there's the training part with the gradient descent. There's also these embarrassingly parallel simulations that are happening. There's also some kind of inference or serving where you take the models and you use them to do the roll-outs. So in some cases actually you have some data processing components where you are storing these roll outs and then using them later. So it combines a lot of different computational patterns together, and it ends up being tough for specialized systems. And you often end up benefiting from. This is an area where you benefit from having a more general purpose system. A lot of that seems like it would overlap with a supervised learning, but it sounds like there's a kind of more things going on in parallel that are different. Why do you think reinforcement learning specifically requires a totally different framework? Well, I don't think it's that just the reinforcement learning that requires a different framework. I think when people build new applications and wants to scale them up, they often end up having to build new systems. And so for example, with companies that we see wanting to do online learning, where there's a training component, you're learning from your interactions with the real world. But then also taking those models that you're training and doing inference and serving predictions back to some application in the real world, right? But to do this there's often a streaming component where you have data streaming in, a training component where you're updating your models and then a serving component where you're sending recommendations or predictions or whatever back out there on the real world. And to do this, again, it's not just TensorFlow, it's not just a stream processing system, or it's not just a serving system. People end up building new systems to do this. But this is also an area because of race generality of where some of the coolest applications we see, you can do the entire thing on top of Ray. I think one thing you mentioned to me, or maybe it was someone on your team mentioned to me in the past, is that a lot of folks are not even doing machine learning on top of Ray. Yeah. There's a mixture. Certainly a lot of people are doing machine learning, but a lot of people they're Python developers. They're developing their application on their laptop and it's too slow. They want to scale it up, but they don't want the investments of needing to build a lot of infrastructure. And they're looking for a simple way to do that. So you're absolutely right. A lot of these people, even if they're not doing machine learning today, they do plan to do machine learning. And you start to see the machine learning being integrated into more and more different kinds of applications. So it's often our users are not like just training a neural network in isolation. Sometimes they are. They're often not just deploying the model in production. Now, they often have machine learning models that are integrated in interesting ways with other application logic or business logic. And is your thought that kind of all of that logic will- Is your thought that all of that logic will run on top of Ray or that it's just the trickiest bits? The machine learning are the most complicated parts going, right? Yeah. Well, so to be clear, Ray integrates really nicely with the whole Python ecosystem. So if you're our users or they're using TensorFlow, they're using PyTorch, they're using Pandas and spaCy, this is part of why Python is so great, right? It has all these great libraries. So our users are using these libraries, and then what they're using Ray for is to scale up their applications and to run them easily on clusters. It's not replacing all of these things. It's letting people scale them up. For sure. That makes sense. I guess slightly switching gears, I feel like a lot of people have been talking about reinforcement learning for quite a long time, and there are such evocative examples and Go. I absolutely love those examples, but I think maybe a knock on it has been It's not maybe used in industry as supervised learning. Is that consistent with your experience, or do you see reinforcement learning catching on inside of more real-world applications? It's certainly not being used to the extent that supervised learning is being used. I think a lot of companies are exploring reinforcement learning or are experimenting with it to see. I think the areas where we see reinforcement learning having a lot of successes are in optimizing supply chains or these kind of operations areas or some financial applications, recommendation systems, and things like that. Of course, that's one application area that Ray supports really well, but it's far from the main focus of Ray or the only focus. Sure. I guess it's interesting, because online learning I would view as more best practice, and I think lots of companies are at least trying to do online learning. Do you have any way of knowing the sort of volumes of the different kinds of applications, or do you have any sense of the relative ... Just from the tickets that come in, do you have any sense of what are the ... Can you stack rank the most common uses of Ray? Is that even possible? I don't actually know the exact breakdown. There are certainly a lot of people doing stuff on the more machine learning experimentation training models. There's a number of people building their companies' products or services and running them on top of Ray, building back ends for user-facing products. A lot of people who are ... It's really just distributed Python, right? Independent of machine learning. Then there are a number of people, and, actually, this is a really important use case, a number of people building not just the end applications, but actually building libraries that other people will use, and scalable libraries. That's exciting because Ray, it's not just good for building applications. It's actually great for if you want to build a distributed system, because it is low enough level that if you were to build a system or library for machine learning training or data processing or stream processing or model serving, it can let you focus on just that application logic, right? Just on your model serving application logic or your stream processing application logic, and then Ray can take care of a lot of the distributed systems details that you would normally have to take care of, like scheduling or handling machine failures or managing the resources or transferring data efficiently between machines, right? Typically, if you want to build, say, a system for stream processing, you would have to build all of that logic yourself, not just the streaming logic, but also the scheduling and the fault tolerance and so on. By taking care of that inside of Ray, we can let library developers easily build these distributed libraries, and that can all give rise to this kind of ecosystem that a lot of other developers can benefit from. Do you think, ultimately, it subsumes what Spark does, or does it live alongside it for different use cases? I think Spark is the kind of thing where, of course, if Spark were being ... Essentially, what we would like is if Spark were to be created today, instead of back when it was created, and if Ray is living up to its promise and really delivering on what we're trying to do, then our hope is that Spark would be created on top of Ray and that for developers who want to build things like Spark, then Ray would make them successful or enable them to do that more easily. So that's a little bit of how ... Ray, it's a lower level API. One analogy is if you compare with Python, Python has a really rich ecosystem of libraries. There's Pandas and NumPy and so on. Spark is a bit more like Pandas, and Ray is a bit more like Python, if that makes sense. Yep. That makes total sense. That kind of reminds me of another question that I wanted to ask, which is is it important to you to support other languages? Do you see it as essential to ... It's funny, because we've had a couple of folks recently on this podcast who have just been surprisingly negative on Python. It's actually not my most native language, but I love it for training machine learning, but it seems like maybe there's some sense that it's slow or hard to scale. Where do you land on that? Well, it is hard to scale, and that's something we're trying to address. You're right, of course. It can be slow, although a lot of the way that libraries like TensorFlow and Ray and other libraries and NumPy deal with this is that the bulk of the library's written in C+ or C, and then they provide Python bindings. So Ray, like you mentioned, is actually written in a language-agnostic way. Tthe bulk of the system is written in C+, and we provide Python and Java APIs. Of course, Python is our main focus. That's the lingua franca of machine learning today, and it's one of the fastest growing programming languages. So it makes sense to focus there. But at the same time, a lot of companies are using Java in production very heavily, and you have companies where, a lot of times, their machine learning is in Python and their business logic is in Java. So being able to have a seamless story for how to invoke the machine learning from the business logic, it's actually a pretty nice feature of Ray. Down the road, we do plan to add more languages. Setting aside your pragmatic CEO hat of wanting to support the languages that people actually want, do you think that Python will stay the lingua franca of machine learning for 20 years? Do you have any set feeling on that? I don't think I have any particularly special insight here. I could see that going either way. But I guess you've dug deep, though, and I feel like sometimes the people building the tools get more frustrated with Python than the people using the tools. I'm not sure. Certainly there are things that a lot of newer features in Python that are making people's lives easier. There's more happening in terms of typing and things like that. You can really do anything with Python. It's extremely flexible. When we design APIs, for example, pretty much any API that we can imagine wanting for Ray, we can just implement that in Python. Of course, when we say, ""Okay, what should the API be in Java?,"" a lot of times, you run into limitations with the language. You can't just have any API that you want. But maybe that flexibility trades off with fundamental constraints on speed, or do you not feel that way? It trades off with something. I don't know if it's the performance or something else. Interesting. Okay. To switch gears again, another thing that I wanted to ask you about is when you started grad school, were you imagining that you'd become a person that runs an ML tools company? Were you trying to become an ML researcher? What were you thinking at that moment? Then what were you thinking when you started this project? Did you imagine that it could become a company, an important open source project, or was it to meet a need that you had at that moment? Yeah, that's a great question. So when I started grad school, I was very focused on machine learning research, and I was actually coming from more of the theoretical side, trying to design better algorithms for optimization or learning or things like that. This was definitely a change in direction, although it was gradual. You have all these machine learning researchers who some of them have backgrounds in math or statistics, things like that, and they want to be spending more of their time thinking about designing better algorithms or better strategies for learning, but actually, quite a few of them are spending quite a bit of time on the tooling side or building better tools or scaffolding for doing fairly low-level engineering for speeding things up or scaling things up. We were in this situation where we were trying to run our machine learning experiments, but built these tools over and over. These were always one-off tools that we built just for ourselves and maybe even just for one project. At the same time, we were in this lab in Berkeley. ... at the same time, we were in this lab in Berkeley, which was surrounded by people who had created Spark and all these other highly successful systems and tools and we felt there had to be something useful here that we could build, or we knew the tools that we wanted. And so we started building those and the goal from the start was to build useful open source tools that could make people's lives easier. And we had the idea for Ray initially, we thought we would be done in about a month. And of course you can get a prototype up and running pretty quickly, but to really make something useful, to take it all the way, there's quite a lot of extra work that has to happen. So that's how we got into it. When did you feel like, ""Okay, this could be a company""? The scope of what we wanted to do was pretty large from the start. We didn't envision this as just a tool for machine learning or just a tool for reinforcement learning or anything like that. It was really, we thought this could be a great way to do distributed computing and to build scalable applications and combined with the fact that from where we were sitting, it seems like all the applications or many of the applications are going to be distributed. So, what we wanted to build was quite large from the start and to really achieve that, it's an effort from a lot of different people and a company is a natural vehicle to go about these kinds of large projects. We were seeing a lot of adoption, a lot of people using it and a lot of excitements and we thought it made sense as a business and combined with the fact that it was a problem that we thought was important and timely, those are the factors that led to us wanting to start a company. And how's the transition been from grad student to start up CEO? It's really exciting. As you can imagine, there's a lot of differences and there's a lot to learn, that's for sure. But I'm working with really fantastic people and even in grad school, like before we started the company, we were working with a great group of highly motivated people. And we had already started thinking about some of the same kinds of problems of how do we combine our efforts to do something that adds up to something larger and how do we grow the community around Ray. It was a pretty smooth or gradual transition. Have there been users or customers that have pulled your product or requirements in surprising directions? Yes, absolutely. I can start with one example on the API side. So actually, some of the initial applications that we wanted to support, like training machine learning models with the parameter server or even implementing some reinforcement learning algorithms, those actually weren't possible with the initial Ray API. I mentioned that Ray lets you take Python functions in classes and translate those into the distributed setting. When we started, it was actually just functions, not classes. So we didn't have the stateful aspect. And that was pretty limiting. Adjust functions are pretty powerful, you can do a lot with functions, but one day we just realized we were doing all sorts of contortions to try to support these more stateful applications. And so at some point we realized, ""Oh, we really need actors. We really need this concept of an actor framework."" And once we implemented actors, I remember Philip and I, once we realized this, we mapped it out and divided up the work and tried to implement it really quickly and that just opened up a ton of use cases that we didn't imagine before. But there was still multiple steps to that. So when we first had actors, only the process that created the actor could invoke methods on the actor, could talk to the actor. And at some point we realized we needed to have these handles to the actor that can be passed around to other processes and let anyone invoke methods on any actors. And that was another thing when we influenced these actor handles that just opened up a flood of new use cases. So there've been a couple of key additions like this, which really increased the generality and the scope of the kinds of applications we can support. But there haven't been too many changes. It's actually been a fairly minimal and a stable API for quite a while. So there's that. And I would say there are other important Ray users that have really pushed a lot and done a lot in terms of things like really pushing for more performance, improving performance, how can we keep making it better? Also, on the availability side, they're running this in production during really mission critical times and how can we make sure that it's not going to fail ever. And also some of, really the like support for Java, that's something that came from the community and both initially adding Java bindings as well as then doing a lot of refactoring to move a lot of the shared Python and Java logic into C plus. Those are some examples. There have been pretty tremendous contributions from the community. This isn't just a request, it's actually committing code- Absolutely. Yeah. How do you think about managing like a large open source community? Like how do you do basic things, like make a roadmap when people are coming and going and have different opinions on what to do? It's a good question. And I wouldn't say that we have totally nailed it just yet, but we use a lot of Google Docs, a lot of design docs on and Google Docs. We use Slack very heavily. So we have a Slack that anybody can join and that's a good way to pull people, ask questions for users to ask questions, anything from the roadmap to just some error message that they're seeing or asking if there's anyone else using Ray on Slurm or something like that. And then a number of other things like just before the pandemic, we were doing a lot of meetups. We were doing this Ray Summit coming up, this coming September, these kinds of events to really meet users in person or virtually and to just get a sense of what people are working on and that kind of thing. That's cool. Have you ever had this situation where like someone submits a pull request and they obviously put in a ton of work, and you're just like, ""Ooh, I just don't want to go there? Yeah, that's certainly happened. And we try to get in front of that by having a design doc ahead of time. And you don't want people to spend a huge amount of time on something like that before if people are not on the same page about whether that's even desirable or not. I think a lot of the time we're really getting a lot of those conversations are happening over Google Docs and over the design docs. And that kind of pushback is moved earlier in the conversation. Yes. But I feel like there's this knock on ML researchers from some people and definitely not any Weights and Biases employees but I think some people that I've met feel like ML researcher's code is low quality maybe because they... or an instance, when they want to get the paper published, then they wash their hands and so they don't actually start to see the maintenance life cycle and they don't learn to architect things well, but I think it was interesting as you started as an ML researcher and actually more of a theoretical ML researcher, which I hear some people think are the worst culprits in this domain. And you went to this very, I think, like very architecture heavy kind of tricky programming projects. Has it been like a transition for you to just up-level like your skills around this or have you learned stuff along the way or do you feel like you started naturally to it? Oh, I've definitely learned a lot along the way. And I think a lot of this was Philip who I work with and one of my co-founders. He's been building systems for quite a long time and has a lot of expertise in this area. So I think maybe there was less of a transition for him. And then combined with the fact that we were in the AMPLab and RISELab at UC Berkeley where people had created Spark, had created Mesos, a lot of just leading people in distributed systems. And Berkeley also has a long tradition of creating great open-source software. So if we were doing this in isolation, it would probably look very different, but we were in this great environment with all these experts we could really learn from. So I think that played a big role. What a great lab. It's amazing how many amazing projects will come out of it. Yeah. And of course, you're probably familiar with Caffe about the deep learning frameworks like that also coming out of Berkeley at the same time or actually Caffe was a little earlier. A lot of people one advantage of machine learning researchers building tools is that they know exactly what problem they're trying to solve. There's some advantages there as well. Totally. I should say, we have and our customers, you maybe can't say this, but we can definitely say it, a lot of our customers are huge fans of Ray- ... say this, but we can definitely say a lot of our customers are huge fans of Ray. That's great to hear. One thing that a lot of our customers use and really like is Ray Tune. And I'm curious specifically how that came about and what your goals are for that. Our goals there are to build really great tools, and ideally the best tools, for hyperparameter tuning. And hyperparameter tuning is one of these things which is pretty ubiquitous in machine learning. If you're doing machine learning and training machine learning model, chances are you're not just doing it once, but actually a bunch of times and trying to find the best one. And this is something, again, where a lot of times people are building their own tools. And you can write your own basic hyperparameter search library pretty quickly. It's a four loop if you're doing something simple. But these experiments can be quite expensive. And if you're trying to make it more efficient or you're trying to speed up the process, there's quite a bit you can do in terms of stopping some experiments early or investing more resources in the more promising experiments or sharing information between the different experiments, like with population-based training or hyperband or things like that. So there's quite a lot of stuff you can do to really make the experiments more efficient. And we're trying to just provide that off-the-shelf for people who want to do that at a large scale in a way that's compatible with any deep learning framework that they're using and just works out of the box. And so is part of the vision there to show people some of the libraries that you think should be built on Ray so that they get inspired to build more libraries? Yes. How do you think about What libraries you should build as a core part of your project and what ones should be third-party? So in the long run, most of the libraries will be built by third parties. But I think it's important to start off with a few high quality libraries that address some of the big pain points that people have right away and are the kinds of things that people would want to use Ray for or have to build themselves if we didn't provide a library. We essentially started with a scalable machine learning trying to provide libraries that let people address some of their bigger pain points. And then for everything else that we're not providing, they can just build it themselves using Ray. Or hopefully in the longer run, other people will build libraries that really flesh out this ecosystem. When you look at machine learning projects that you've been part of or that you've seen and you look at the whole arc from conception and experimentation to deployed and useful and production, where do you see the most surprising bottlenecks? So one obvious aspect is the bottleneck around scaling things up. This is one of the core things we're trying to address with Ray. One less obvious bottleneck is about interfacing machine learning models and your machine learning logic with the rest of your application logic. And one example where this comes up is with deploying or serving machine learning models in production. So web serving has been around for a long time. And you have Python libraries, like Flask, which lets you easily serve webpages and things like that. So what's the difference between regular web serving and serving machine learning models? Superficially, they might seem pretty similar. There's some end point that you can query. And in fact, when people are deploying machine learning models in production, they're often starting with something like Flask wrapping, their machine learning model in a Flask server. But you run into a lot of pain points there as you start to deploy more models or you start to want to batch things together or you start to want to incrementally roll out the models or roll back or things like that or compose models together. At the other end of the spectrum. You have specialized systems or tools for machine learning model serving, so things like TensorFlow Serving. I think there's a PyTorch one as well. And the challenge with a lot of these frameworks for machine learning model serving is that they're a little too restrictive. And so often, it's just a neural network behind some end points. It's a tensor-to-tensor API, so a tensor going in and then a tensor going out. Often what you want is to have the machine learning model as part of the serving logic but actually to have other generic or application logic surrounding that machine learning model, so whether that's doing some processing on the input or some post-processing on the output and really combining these things together. So that's one pain point I've seen quite a bit. And we're actually building a library called Ray Serve on top of Ray to really get the best of both of these worlds. Cool. That's awesome. Okay. My final question is, when you look at machine learning broadly, research but also production, implement, all these things, what's a topic that comes to mind as something that people don't pay enough attention to, that's more important than the credit that it gets? So I'm not sure if this is underrated, but one area that I think has a ton of potential is in using natural language processing to help people ask questions about data and help people ask questions about all the information and data out there. And for example, the fact that if you Google something like a simple fact, what year was George Washington born or what's the capital of California, you immediately get an answer. And so it makes it easier, natural for people to ask interesting questions about facts and to realize that there's some ground truth out there. Now, if we can provide similar tools that let people ask lots of questions about data sets that are not simple facts that you can look up in a database, but rather have to be inferred by performing some regression or some filtering or some basic statistics, what is there correlation between X years of school and income or things like that, the hope is that it would become very natural for people to start to ask questions about data and to get in the habit of trying to glean information from data sets out there. And I think that's something that's becoming more possible. And it will be very exciting. What a great answer. I love it. Thank you. Awesome. So I think one of the things that's coming up shortly that I'm really excited about is your Ray Summit. Can you tell me a little bit about what you're hoping to accomplish there and who should come to it? Absolutely. And also, really excited for your talk there as well. So if you're interested in learning more about how companies ranging from tech companies, like Microsoft or AWS, to companies in finance, like JP Morgan and Morgan Stanley or Ant Financial, to startups and researchers are using Ray in production to scale up or speed up their machine learning or their Python applications, this is going to be the best place to do that. And we're really excited. We're going to be hearing from some of the leading figures in the machine learning ecosystem as well as people like Michael Jordan, people from DeepMind, Google Brain, as well as prominent people in the Python community, like Wes McKinney who created pandas, as well as tons and tons of companies using Ray to really do machine learning or scale up their applications. It's an area where it's an opportunity for the Ray community to see more about what everyone else is doing, to get to know each other better and to really showcase some of those use cases. Nice. I'm really looking forward to it. I'm definitely going to be there. And yeah, I think I'm giving a talk. Yes, you are and we're super excited about that. Awesome. Thanks so much. We appreciate it. Thank you. When we first started making these videos, we didn't know if anyone would be interested or want to see them, but we made them for fun. And we started off and making videos that would teach people, and now we get these great interviews with real industry practitioners. And I love making this available to the whole world so everyone can watch these things for free. The more feedback you give us, the better stuff we can produce. So please subscribe, leave a comment, engage with us. We really appreciate it.",6445 +Ines & Sofie — Building Industrial-Strength NLP Pipelines,https://www.youtube.com/watch?v=v550Ve66vEc,3520,2020-10-29,"We did have people in the past who were like, ""Well, I want to put my system to be 90% accurate."" And then we're like, ""On what?"" And they're like, ""No, no, 90% accurate."" And it's like, ""What you're doing with your system will decide about how successful your system is, and that's what you want to measure. And that's what you want to focus on."" And I can see how this sometimes gets lost if you're not thinking about it, and all you follow is like research, which has kind of a slightly different objective, because you're comparing algorithms. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Today, I'm talking to an Ines and Sophie. Ines is the co-founder of Explosion AI, which is a digital studio specializing in tools for AI technology. She's a core developer of spaCy, which is one of the leading open source libraries for NLP and Python, and Prodigy, which is a data annotation tool powered by active learning. Sophie is an NLP and Machine Learning Engineer at Explosion AI, and she has more than 12 years of experience in NLP and machine learning, including in the pharmaceutical industry and the food industry. I'm super excited to talk to them today. Hi, I'm Ines. Some of you might know me from spaCy, which is an open source library for natural language processing in Python. And our company is called Explosion, and we specialize in developer tools for AI and machine learning, NLP specifically. And so we build tons of open source tools that are quite popular, which is really cool. And we also have a commercial product, which is called Prodigy, which is an annotation tool for creating training data for machine learning models. It's a developer tool, it's fully scriptable in Python, so you can use it to script your own custom data flows, your own custom labeling recipes, put a model in a loop. There's all kinds of cool stuff you can do with it. And that's what we do. And we're currently working on version three of spaCy, which we're very excited about, and we've really kind of finally take spaCy to the next level, let people use all this really new, modern, cutting edge NLP stuff. And we're also working on Prodigy Teams, which is more of a software as a service C extension to Prodigy, and really lets you scale up your annotation projects, and manage larger annotation teams, but without compromising on the data privacy and script ability. Okay. So before we jump into the details of your library, just as a fellow entrepreneur, I have to ask, tell me the story of how did you start this and what was it like? Yeah. So, spaCy started when, well, my co-founder Matt left academia, so he was working on NLP, he was researching, publishing stuff, but he always tells the story as like, he got to the point where I had to write grant proposals, which he didn't want to do. And the same time, he was realizing that people were using his research code in live production niche environments, which was kind of ridiculous because they're still research codes. So it's like, well, there's clearly something there. So he left academia and started writing spaCy as an open source library. And then shortly after that, we met here in Berlin and we started working together. And initially, I was working on more visualizers for the library, which are still an important part of what people like about spaCy. But I was a bit skeptical at first. My background is not NLP specifically. I've always programmed, I did linguistics, but my first reaction was like, ""Sounds kind of boring. I'm not sure if I want to..."" Really? Yeah, quite literally. Yeah, so he was like, ""I'd love to have this syntax visualizer."" And I'm like, ""I know totally what it is because I know linguistics,"" but I was like, ""That sounds... I don't know if I want to do that. I'd rather work with something else."" But I did, and it was good. And it's still very popular, right Ines? Yeah, I think when people think about spaCy, they think about our syntactic visualizer a lot. So I'm glad I did it. Yeah. For what it's worth, that's my impression. It's quite beautiful. So yeah, that's how it all started. And then at some point we were like, ""Well, okay. We want to found a company around this,"" but we also knew we didn't want to go down this typical startup route because we saw, ""Hey, look, there's tons of money to be made. We can just run our company. We don't need to run at a loss. We don't need to have this crazy scale in the beginning. We can just build stuff and make money doing it."" So that's how we set up the company and we're still fully independent, which is cool, and it gives us a lot of opportunities. And we're now a small team of, I think, eight people at this point, including Sophie, who was, I think, one of our first full-time NLP people who joined the team. Nice. And so Sophie, how long have you been working with the team? So I guess I started working for spaCy at the beginning of 2019. So almost two years now. Wow. Yeah, I think I met Ines in a pub in Brussels. Yeah. After an NLP meetup. Yeah, it's not like we met randomly. So there was definitely a theme to the day. And I think I really loved their vision, not just for how to run a company, but also on how you should iterate over data and your models and just this very pragmatic view on how to apply machine learning in an industry context, really. Because I guess you've also seen this done okay, but also you've seen this done quite badly in maybe some contexts. Yeah, exactly. So my background is I do have a PhD in the Biomedical LNP, so I'm on biomedical texts, but then I worked for Big Pharma for three years, so J&J, and indeed there's many examples on how to apply machine learning models and how not to apply machine learning models. So yeah. And I've always been, I think, working on open source is just the best thing there is, so they didn't need much convincing, I think, to start working. Nice. Well, we'll get into the technical details in a second, but I have to ask, so how do you make money? How does your business operate [inaudible 00:00:06:18]? Yeah, we sell a product. So we do this really crazy thing where we sell something and people give us money and then we spend less money than we make, and then we make a profit. No, but it's just like, for Prodigy, you can buy Prodigy. And we sell Prodigy for a lifetime license, so it's a one-time fee. And the great thing about being in the software business is you can sell a piece of software and then you can sell it again. You have an open source model, right? So what parts are open source and what parts are for sale? So we've never really liked this idea of having this sort of freemium thing or this kind of open core, because the problem there, it introduces a lot of questions, and it puts you, as a company, in a very weird position because we want people to have an easy time using our tools. We want our docs to be great. We don't want to sell help with our own software. So if we did the consulting model where we sold stuff on top of spaCy, we'd constantly be in this position where we're like, ""Well, if our docs are too good and our library is too good, nobody gives us money. But if it's too shit, then nobody wants our services because they're not going to use our tools."" So we never wanted that. And then we also see, because there's always a difficult story around an open source library, if you suddenly have these components that are free. And in general, algorithms is not really what I think is the thing that you should be selling. There's developing experience. That's something that makes companies a lot of money. That's something people will be paying money for, just like...That's big thing. Why people use and pay for Weights and Biases, for example, it's developer productivity. Same with data, anything around creating data. That's where the real customization comes in. That's the valuable stuff people are working on. And that's also where products can be. So we have a separate product that, probably you'd be interested in if you're a power user of our open-source tools, but it's separate. spaCy is free, the code is open source, and we sell additional products to it. And is the additional product Prodigy? Yeah. That's currently the one product we have. But we also, as I mentioned earlier, we're working on Prodigy Teams, and that we'll have a more SAAS type of model that people have been waiting for. I see. But Prodigy's a software that I could buy one time. Yes. Yeah. So you can go online right now, go onto online shop, buy it, download it, PIP install it, use it. That was also the idea. We really want to make the path for a developer as easy as possible and make it easier to start using our tools. And then there's a lot more you can be doing. And I guess the typical person that you sell to, and the person that uses a spaCy, is someone working in natural language processing, trying to build models. Do I have that right? Probably developers at all kinds of companies from top Fortune 500 to startups, to academics, researchers. Also a lot of people are getting into NLP now who don't have the classic machine learning background, as in like, ""Oh, by a computer science PhD, and then start up or something."" It's a lot of people, like in digital humanities, from the medical field. There are lots of people entering the field now who want to solve problems. And that's what we also find very interesting, because they're bringing the domain knowledge, they have a problem they want to solve, they know how to solve it. And then, okay, they're learning machine learning, which I would say is often a better path to success than coming from the other direction, knowing machine learning and then thinking about, what problem in the world can I apply it to? I think a lot of terrible products have been born out of, especially, a very naive or arrogant view on this sort of thing. So spaCy, your library, let's talk about that. I want to say before we talk about kind of the new things you're doing, maybe for someone who hasn't heard about your library looked in detail, what are the big components and what was it designed to solve? Well, it was initially developed to really process large volumes of texts at scale, efficiently. So, you want to process the whole internet, well, the internet is not going to get smaller. Computers are not going to get faster. So you want to have be efficient software to do it. And it started out by just having... And also, of course, it was always designed to be used in production and in industry use cases, which was, especially at the time when we started, wasn't such a consideration. Most code is written for research, so spaCy really took the other approach and said, ""Okay, look, you want to process texts and do stuff with text. We're giving you one implementation, which we think is the best. We're not giving you like 50 parsers you can all benchmark and play with. We're giving you one that works best."" So that's how it started. Initially, we have different components, for different things you can analyze about language, from starting, with what's even a word? Because it sounds very basic, it's obviously a lot more complex than it sounds, to what concepts are mentioned in the texts. What's the text about? What's a verb? What's a noun? How is stuff connected? And then various other things you can build on top of that. I think one of the main features in version two of spaCy that is currently available is that there's a lot of pre-trained models for different languages. So people can come and say, ""I want this general purpose parser that just parses my French texts and tells me what the labels are, what the part of speech tags are, what the entities are like, which are persons in the texts and these kinds of things."" But I guess what we've seen over the years is that people are now wanting to more and more train their own models. So we don't want just one French model, right? There can be very different texts, like biomedical texts will need quite a different parser than just your general domain English news or French news. So I think this is also a little bit the shift to a version three is where we're making everything much more configurable, so that people can also go and train their own models, and you won't just be limited to the pre-trained models that we'll have online, even though we'll still have them for people to quickly start with, but they'll also get much more power over-training their own models, I think. So I think this also reflects a shift in how people are dealing with NLP over the last years. I feel like you've had front row seats to sort of the big shifts in language processing. And I feel like my impression from the outside is that, some people might even skip the parsing step for a lot of applications. Are you seeing that? How true is that? It's difficult to say what all the different users are doing exactly, but this definitely has also inspired the move to version three of spaCy, so the transformers that are being published by HuggingFace, is huge repository of models that are extremely useful for NLP, will basically become available within spaCy three just through our spaCy transformers library. So you will just be able to sort of plug and play and just put a transformer in your pipeline and use that. And if at that point you feel like you don't need a specific parser model anymore, sure, you could definitely go and try that. So you'll be able to just write your own model on top of a transformer output and see how about that does basically. Yeah, but also I think another thing, on that topic, that we've seen is while, of course, there are a lot of interesting things, especially in research, of end to end things you can predict that are now really exciting and actually work like. You can model a really complex task and really predict it end to end, and you don't need to go via all these different components and stitch it together, and that's very exciting. That's one thing people might be doing, but we've also seen that in more real life applications, it's still often very, very useful to have different components that you put together that you can train on very specific data, and also that you can adjust, retrain, and customize. And that's just because a lot of the things people are actually trying to solve in industry and in real life, so to say, are things that you just can't easily predict end to end and throw a huge language model at it, and then that's it. Of course not, because otherwise, those problems wouldn't be valuable, and companies wouldn't be spending a lot of money on it. The most interesting and most valuable things you want to build are the things that are very specific. And so for these, you often want different building blocks you can combine and train, and that's also something we want to make easier going forward. Could you give me some examples of things like that, that companies care about, but maybe isn't as well studied in academic circles? I wouldn't even necessarily say that it's something that's not studied in academic circles. It's more like... Okay, often one thing that people always want to do is information extraction. You'll have- Often, one thing that people always want to do is information extraction. Companies have amassed texts and texts and texts, and now you want to find something out about that text. And I think one example I sometimes use in my talks is imagine you have all these financial news that you scraped and you want to extract company sales, and you want to extract who bought which company for how much money and in which country and for what currency or something like that. That's like stuff in this sort of space is like endless and many companies want to do that. And sure, it's something you can try to... If you have a huge language model, you can try and just throw that prompt at it and come up with some ways to find the model. And maybe at the end of it, it will output your structured JSON representation that has all your data in it. But that's often not the most efficient way to go about this problem. Like maybe you start off and you say, ""Cool, I want to detect company names."" Maybe for that, you could train an entity recognizer to predict GitHub as a company, Microsoft as a company, whatever. Same works quite well for money. Actually, probably in real life, you also have tons of noise. You want to try to classify it first to filter out whether something is actually even a text you care about. Then, okay, you have these random money values. You want to normalize them. Do you need machine learning for that? Maybe not, probably. There's a quite simple algorithm that can do that for you, that you want to just combine. Then you want to look up a stock ticker. That's not something you want to predict. You don't want to have like a sequence to sequence task or like, or whatever. I don't know. You don't want to have a language modeling thing. And for that you look it up in a database or somewhere on the internet and then you want to put that all together and put it into like a structured format that then you can feed into your database. And that's, I think is a good example of... Yeah, that's super common. Predicting this end to end is really interesting research topic, but not a practical approach. And it requires all these different components and maybe some component in between is like, yeah, hack together regular expression that does part of the job, great, and works 99% accurate. What more can you want? And so that's how we see a lot of the NLP done in practice. Well, that's consistent with the NLP that I've seen in my life. So that makes total sense to me, I guess, what are the parts of that that kind of spaCy helps with that? Or maybe, what are the decisions that spaCy has made that might be different than a more research focused NLP library? I think I would say the opinionated take on stuff. We're like, well, there's one... We're moving a bit away from that in version three just by keeping people- Yeah. I was going to ask, because like with Huggingface, there's lots of choices. So I guess, how do you handle the tension between those two goals? Yeah. I mean, in general, we do want to give people the most reasonable default that works best because we think that's good. And even for example, with the transform integrations, we've been running experiments. We've looked at what models actually work particularly well. So we can also provide some guidance there and say, ""Oh, if you care about efficiency and you want to use that language, probably use those pre-trained weights."" But yeah, more generally we've always started out being quite opinionated and also focused on efficiency. That's something... It's not a researcher's fault that like, ""Oh, your thing is slow."" It's like, well, yeah, that's not its job. Its job is to produce these benchmarks so we can compare algorithms. That makes sense. That's sort of problem that research has. It's just that, okay. If you want to actually use a lot of these things, you need to make choices and modeling choices that actually get your job done efficiently. So that's... Oh, I see. So even like picking models that will run efficiently. Architectures, how we set up the pipeline, for example. A lot of models will have embeddings for each of the tasks they're doing, and you have the embeddings copied for maybe component that can work, but makes the model quite large. You're always recomputing a lot of stuff. So we're thinking about, well, how could we make it easy to share some of these things for multiple components? So when you're only computing the representations, what stuff like that? Those are all decisions that we have to think about that maybe for a researcher, it doesn't matter so much. I guess one thing that I've always seen in industry caring more about maybe than... I mean, academics talk about multi-lingual support a lot, but I feel like in the end, many, many papers are written on English Corpus. There's good reasons why, I guess, but it does seem like multi-lingual support is front and center to most big companies. Right. Because you have, texts in multiple languages. Is that something that you've thought a lot about? Yeah. Sorry. I thought Sofie was going to answer for a second. Okay. No, no, absolutely. And also there's often more to supporting a language than just training some random Corpus that's available for the language. For example, it's basically auto-organization algorithm produces actual work, so it's linguistically motivated organization. And that also introduces a lot of considerations for like, okay, how do we deal with that language? What characters does that language have? How does that language normally work? Then to okay, what data can we train on that actually is useful? Because that's also not necessarily the same across research and industry. But yeah, of course there are lots of... And it makes sense why everything is in English in research. You can't fault an individual researcher for evaluating and running their tasks on an English corpus because that's just where the competition happens. But yeah, in a more real life scenario, sure. And I guess bioinformatics kind of like an in between where maybe it's in English, but it's just such a different domain. Like do you suggest to people use different models if they're working in that domain? How do people think about that? Yeah, no. You'll definitely need different models. I mean, there's just such a difference in grammar and the kind of entities and words that are being used in biomedical texts. And I think there's plenty of domains like that. Like finances and biomedical. These are all very different domains and you really want to train your model on that specific domain. And so what we've seen is not just for the languages, we've seen a lot of community support. Like for instance, if I remember correctly, Ines, the Japanese model had a lot of- Oh yeah. And Chinese as well. Yeah, yeah. From the Japanese support and Chinese as well. Has a lot of support from the community, because obviously it's difficult with like, not even 10 people to be able to support all of these different languages, if you want them to be linguistically sound. But also for the biomedical domain, actually there's even a plugin which we call... We have this list of plugins that we call the spaCy Universe. So people that just write to different packages that plug into spaCy and they have trained specific models for biomedical domain. So that's just perfect to them to go and use that or at least start from that, if you're processing that kind of text. So I think spaCy is quite nice and that sounds that there's quite as big community around it. So that's step one, that you mentioned is called sci spaCy. sci spaCy. That was developed by Mark at Allen AI. And there's also a project called Blackstone, which does the same for the legal domain. And I think both of these actually are great examples because when you look at the components that are implemented, you can see a lot of thought went into what these components should do and what's appropriate for that specific domain. Like, Oh, what do we need to do differently if we want to segment sentences properly in legal texts? If you want to do this well, you need to understand legal texts and how these things are written. What are the problems? Okay. Maybe we can have... This court, there's this component for resolving acronyms, I think which is like users, a specific algorithm. And it's pretty basic, but it can be implemented with spaCy and it just works. Thus, the job extends what's already there, but I think it's very interesting to see these projects where you can really tell, oh, a lot of knowledge and insight about the field went into developing that specific model. And yeah, I guess that's why we're back at the hammer and nail type stuff. I want to go a little deeper just because you were talking about this before, and we were chatting about the new spaCy library and you're showing me all the stuff that it can do. It seems like you put a lot of thought in a lot of different components. Like one that I just recognized as someone who's also wrestled with this problem is your configuration system looks super cool with the nested configs and the way that you can put in actual logic in there. Can you talk a little bit about... I feel like people might not realize how complicated setting up configurations would be if they haven't wrestled with the problem before. Yeah, I agree with you. I think personally, this is one of the biggest strength of the new release coming up. The version three of spaCy needs these configuration files. Because before that we would just have all of these defaults across the library and then it would be difficult to really get at them and change them. And now, with the conflict system, we basically just define all of the different components in a NLP pipeline so that you know exactly what is in there and what isn't, and then you can basically tune all the different parameters of each component. And I think Ines has worked with the most on actually the backend of this conflict system. Right. And getting it to work and all of the filling in defaults and validation stuff. And I think we battled with it for a bit, but I think right now it feels very robust. And like the other day, I was driving a config and you made some kind of, I don't know. You write false with a capital or something such and the system just automatically tells you this can't be right. I mean, did you mean a string? Is this a bullion? What do you want? And so it's automatically it fails. And I actually think it's fun to work with because you get stopped very easy, very early on in your experiment. You get this feedback from, is this a valid conflict or not? Can you continue with it? I think just we have to accept that bugs will happen and that things can go wrong. And especially, machine learning is just hard and it's complex. You're basically passing these super abstract arrays and things with hundreds of dimensions from function to function, and then you're computing shit with it. And then you're passing that all the way back and hopefully something comes out at the end. That is just like complex. So I feel like probably everyone who listens to this can relate to debugging, couldn't broadcast shape blind to shape laugh. I mean, that just comes up and you're like, ""Fuck. Somewhere I have a bug."" And that happens like all the time, where it's like, Oh, you have the hidden width set to whatever here. And why does my thing not learn? And why does it all fall apart? It's like, yeah, because over there you need to set that same width, and you don't because it uses the default value of that keyword argument that you've set to minus one. I don't know. That just kept coming up. We were like, ""Well, that sucks."" And mistakes will happen and problems will happen. And stuff's not going to get less complex. Also, you're not solving a problem by just abstracting away the complexity. That's another thing. If you see these concrete files and it has like every parameter defined, everything in there, you might be like, ""Oh, well, that's like super complicated. How is that easy?"" And it's like, well, easy doesn't always mean no cold out of the box. It's like you need to solve the problem. And that's something you can do by providing better development experience, not just by blocking it away. Totally. And I feel like... Oh, go ahead. Yeah. I just wanted to say, so I'm not sure when we started implementing these conflicts, I think around January, maybe. So I think we've been working with them now for more than half a year ourselves. So I think that's also we just saw what all the problems were and like these hidden... Like the parameter in that part of the country needs to be the same one as a nested parameter in that part. So we have all these referral mechanisms and so on. So I think if you sort of battle tested it by now that hopefully it's these common errors don't pop up as much anymore. Yeah. But it feels kind of nice. Yeah. No, it's always satisfying. You built something and then you actually use it and you're like, Oh, that actually works. I mean, not that it wouldn't have. Yeah, when we started playing with Weights & Biases, that actually also came up again because we were like, yeah, let's just log this whole conflict and then we're like, Oh wow. Now we can actually see how all of these values relate to each other. And it's just like works and it makes sense. And that was also very satisfying. Yeah. It's funny. What you're saying about some of the typing and some of the stuff you said earlier about wanting things to run fast, kind of makes me wonder what you think about Python in general, because we've actually had some very strong, different opinions from different ML researchers that we've talked to. So I'm curious to get yours as an author of a famous Python library. What do you think about the language? I mean, the thing is... Python of course has won at machine learning, however you want to call it. And I think it's surely of course, because Python was there at the right time, in the right place. It was fast enough. It had support for C extensions, but it was also a general purpose language. Like that's something I always like to point out. The reason Python is popular and works and works well also for the type of stuff we're doing is that you can do all kinds of things in Python. Python was really big for web development stuff before the machine learning thing started at scale. And that also means that it's a general purpose language you can learn and get into from even something else you're doing. And that's why an AI language has also never really taken off. It doesn't work. You want a general purpose language to write it. And I think that's why Python is so popular. And that's also why I like Python. And yeah, sure. You have to put some work into it to make it fast. In fact, spaCy is written in Cython, which is kind of this dialect of Python that lets you write, see in the Python syntax. spaCy's a bit known for like, Ooh, Cython. For some reason Matt has become like kind of known for like, Ooh, writing stuff in Cython which some people can still find a bit intimidating, but I don't know. How did you feel about it? Because Sofie, you learned Cython. I don't know. How did you feel about it, because Sofie, you learned Cython. Yeah, I think you're making it sound more scary than it is, Ines. Oh, really. Because not the whole of spaCy is implemented in Cython, obviously, it's just the parts that really matter efficiency-wise. So yeah. I think it's a very interesting question, because I, myself, I actually come from a Java background, which is obviously quite different. So, I think personally, I'm really happy with the combination between Python and the typing system, because you get a bit of the goods of both worlds. You have Python which, let's face it, just programs much more easily than Java, and there's just so much less overhead and so on. It's definitely grown on me. And I think the typing, I do really like it. Because especially if you're writing your own machine learning models, or think that our open source libraries for machine learning has all of these types also integrated, so that if you're trying to combine layers that just don't have the right input and output types, it will tell you, again. And it won't just try to propagate these meaningless areas that have wrong dimension, and then just scratch somewhere in between it, it will tell you upfront. So I think that that really helps and the type system really works there. Yeah, and I think there's a lot of exciting stuff happening in the ecosystem. It's still quite young. It's also the static type checking, MIPI, that's all under very active development, just like Python, itself, really. But it's very cool to see some of this stuff actually work. Or using a modern editor and just seeing something underlined, and you look at it and you're like, ""Oh yeah, I passed something wrong to this function."" That could have easily taken me a long time, because I passed a string and it should've been a list, and the string is also iterable. Oh, yeah. Those sorts of bugs that everyone can relate to, and that's pretty cool. And also the ability to hook into that system, and into the static type checking via MIPI and implement your own plugins for your own libraries and use cases. I think that's something we're going to see a lot more of in the next couple of years as the ecosystem around this matures. Yeah. Do you see demand for other languages? As in programming languages or using- Yeah, do people ever ask you, hey could you support Java? Well, I mean, it's still the Python library. But that said, there's a very popular wrapper written in R, and that's still a very popular language. If anything, I would say actually, in our blank space, this would be the other main blank language that people are working in. And sure, it might be biased because I don't know what people who are working in Java are doing, because they're surely not using Python. So, I mean, I don't know. It's like, ""Oh, we never hear from people who work in Java. Yeah, because they don't use our stuff."" And that's fine, fair enough. But yeah, I think it's also because R and Python integrate quite well. So it's basically just this wrapper layer, and it fits in with a lot of people who are working digital humanities, social sciences. Actually, it's quite heavily R based, but they also have tons of texts to analyze. So, they often then use spaCy via R. Yeah. Got it. I guess, is there other things? I mean, one thing that I think of with your library and configuration systems is reproducibility effort. And especially, I imagine working with this range of people, especially in academia, but I think in industry too, it's so important to have things reproducible. Is there anything else you're doing to move things towards making things clear and reproducible? Yeah, and I think that's definitely one of the main reasons for having the config like it is. So that basically everything is defined there, and also you can just set your seat of your random generator, so that it should pretty much be able to reproduce exactly the same ways, even in the machine learning models and so on. So that is definitely something that we care deeply about, as well. Yeah. And of course, another part is that, well, it's not just the model you're training. They're always these different components. You have the data you're downloading and loading in. You have some other script that you just need to pre-process something, and so on. And so another feature of spaCy three will be what we call spaCy projects or project templates. So it's a CLI interface that lets you work with more end to end workflow. Because often, yeah, you don't just run one command, you run a pre-processing step, you download something, then you train, then you want to evaluate. Sometimes you only want to rerun the training if your data changed, or if something else changed, or if your results changed. So there are all these interdependencies, and that's something we felt like was quite difficult to do fast internally, as well. So, that's what motivated this idea. You kind of think of it a bit like CI config, if you've ever configured something like pipelines or Travis. I mean, if you haven't, well, I guess you're lucky you never have to wrangle with CI thing. Maybe it is one of these things where I'm like, ""Oh, I know way too much about these things, yeah."" Yeah, you just basically, any system actually, you find a series of steps. You can have a file to do that. You can download data or anything else, any weights you need. And then you can upload that to a get hub report. We're going to provide lots of templates you can clone, and then that also makes it very easy for people to get started. Or even something as basic as a benchmark. We're currently running benchmarks, of course, because we need to test all the stuff we've built there, and we don't want to launch without having some numbers. But that makes it very nice, because we have the steps defined, we have the data defined, that's loaded. We have the processing script defined. We have everything up to the random seed to set, so anyone running that should be able to reproduce that. And so, if you say, ""Hey, cool, I would love to run these benchmarks."" You can do spaCy project clone, benchmark, whatever, downloads it, kind of like it, then you run assets, you run a named workflow, and then it just runs it. And then you can rerun a step. It will only rerun if actually things changed, and that system also makes it very easy to integrate with other libraries. Like for example, oh, you can have a script that does something very, very specific you want to do with weights and biases, that you wrote your custom function for, that integrates with the config. Or we have one project that shows how you can easily have one step that serves your model, using fast API, which is probably one of the most popular tools. And also the developer happens to be on our team. So that's a way people are always using space in fast API, and people have always asked about integrations. And we were like, ""Well, it just works."" And that's actually something you'll be seeing in that integration. ""Well, is it even an integration? Because it just works."" Or a streamlit visualizer, that's also pretty cool. I imagine you run your steps, train, you have your output, your artifact, and then you just run visualize, and then it spins you up this app. Plus tons of... I don't know, I don't even know what people are going to build with it. And I'm very excited. So, that's also, yeah. That's super cool. I want to make sure I have a little bit of time to ask you questions about prodigy and data, because that is my former life. I also worked in the space. So, I'm also super passionate about people getting good training data. I'm curious if you could tell me a little bit about how Prodigy works. Maybe, does it integrate with spaCy in a special way? So yeah, obviously because we developed spaCy, so for the NLP capabilities and NLP workflows, we obviously had lots of opinions and ideas. So, they're lots of workflows that use spaCy. But basically part of the idea started when we started working on models and things and we wanted to create our own data. And that was also at a time when we realized, look, you don't need big, big data anymore. You don't need billions of labeled examples. You can do that, but often what you need is something very, very specific to what you're doing. And you want to create that, and you want to create a lot of it. And you don't want to have a meeting, and outsource it, and then get it back, and you're like, ""Why is this so shit?"" Then it's like, ""Well, yeah, because you just ask someone to label all persons and didn't tell them what you wanted."" And so, why are you surprised? And you paid them $2 an hour. ""Surprise,"" that the core of your application is kind of shit, if that's how you treat the data collection process. So anyway, that's something we've also seen a lot. And actually, very early on when we started the company, we did a bit of consulting for about six months to, we call it, we raced the client round. So, to get some money in, and also to see what people are doing in practice. And data collection always came up, every project. And also, it showed iterating on this was very difficult, because how people did it in a spreadsheet. And then often we're like, ""Oh, maybe you should try with that type of label scheme. Maybe you'd want to change this around a bit. Maybe predicting something else is actually more useful."" That's something, actually, to go back to our industry versus research thing. That's another thing people often forget. If you're not in research, you're in industry. You can choose how easy you make your problem. You can't do that if you're researching stuff. You can't be like, ""Oh, that task sucks. I'll just do a different task."" But if you are solving a problem, you can choose how easy or hard you're making it. And that often needs trial and error and you need to try things out. So that's how Prodigy was born, because we were like, ""We want a development tool."" Labeling data needs to be part of the development process. At least initially, before you scale it up, you want to be building these workflows. And you want to write them and place it. You want to load a model. Maybe you have a model, you want to have the model present you with what the model thinks is most likely correct. And then you can say, ""Yep, no, yes, no."" That's very fast. Or maybe you want to correct what a model does. Maybe you want to do something from scratch. Maybe you want to label entities, text categories, lots of things. So, that's what really motivated the tool. So in practice, it looks like this. It's a Python library you install alongside your other stack. You can use it to start up the web server via the CLI. And then you have a modern web app with an interface, that really focuses on one task at a time, doing it as efficiently as possible. Can move or label some data. If you're in a good flow, you can do, I don't know, a few seconds per example. So, it's actually really viable to say, ""Hey, you spent an hour and created a data set of a few hundred annotations."" And nowadays, that gives you enough to at least validate your hypothesis. You'll have some idea, ""Hey, how about we predict this as text classification task."" And then you're like, ""Is it going to work?"" Who knows? You have to try it. I mean, that's machine learning. I don't know. So, yeah. Yeah, and I think the other interesting thing is also that because we're targeting developers in NLP, I've spoken to quite a few people who are using spaCy in industry. And what is interesting is when they go and just annotate a little bit of the data themselves in Prodigy, right? They can script their own recipe, and they can annotate a bit of data, because that also helps you understand the domain better. And that's definitely going to help you model the challenge better. So, it's really this fast iteration of how could I annotate a data? What would make sense in my machine learning problem? And basically, knowing a little bit of the both worlds, rather than indeed just having thrown some data over the fence, and then trying to make sense of that from machine learning experts, which just doesn't doesn't work. So, yeah. And I feel like another difference, that you see, I guess, in real world, especially in NLP write is this loop of model gets trained, a little more data gets collected in particular ways, and modeling gets trained. Because how do your tools support that kind of process? I'm sure you've thought a lot about that. Yeah. So, definitely there's continuous... First just making that point to people was very important to say, ""Now your model isn't trained and done. Your model's never done."" You need to plan for a continuous process of improving it. And ideally, you also want the model in the loop somewhere, at some point at least, because you want to see, ""Am I actually producing stuff that's reasonable?"" And you can do that in different ways. You can actually have the model present it's suggestions, and you can annotate them, and give feedback, and evaluate your model that way. Also, one workflow we thought of is, well, what if you actually just focus on the basic uncertainty sampling? Even something very simple where you say, ""Hey, let's just see which text categories have scores closest to 0.5."" Because that means that it could go either way, and no matter how you correct it, you always have a gradient to update, and you have the biggest gradient to update with in either direction, because you're in the middle, right? So there's always something to learn from. And that's another approach you could take. And also, just allowing people to quickly spin up these experiments, and not having every update you make to your model be a whole bureaucratic process, because that's also often what it ends up. Developers want to develop. You don't want to have five meetings before you can start on your model. You want to just write code. And that's something people definitely appreciate, that great, I can get to work. Yeah. I don't have to schedule meetings. It makes sense. Yeah. So, we always end with two open-ended questions, and I want to make sure I give you both a chance to answer these. So the first question is, when you look at all of machine learning, including production stuff, is there a topic that comes to mind that people should be thinking about more than they are? I think for me, personally, and I think everybody who knows me in the NLP domain knows this, it's probably normalization or entity linking. And this is also one of the models that I worked on for spaCy. So basically, if you have a text and you've already been able to annotate something as being a person, or a location, or an organization, or whatnot, that's fine and that's interesting, but you also want to know what exactly is it. So, being able to give it some unique identity, preferably from a database or knowledge source somewhere. And for me, this is really a crucial step in NLP, because it links your text-based analysis, because a lot of the other steps are just based on the text, itself. ... a lot of the other steps are just based on the text itself. And it links that to the outside world and an external knowledge base that you can then use to integrate your textual knowledge with other information from databases. Because for instance, if I think back about my BioNLP backgrounds, there is a lot of interactions, protein interactions known in databases. People record them as structured information. And then there's another set of interactions that are only written in articles. And you would be amazed at how little the overlap is between the two. So you really want to be able to integrate them and combine the both sources of information. And so for me, energy linking or normalization is... it's a very difficult challenge and we've definitely not solved it in spaCy yet, or it hasn't been solved in general yet, but I think this is an extremely interesting and crucial even at a step in NLP. Yeah, and it's definitely also the type of tasks that we want to make easier and provide likewise for people to do. Okay, is it my turn? It might sound a bit basic, but the idea of really just sitting down and reasoning about what the fuck you're doing, even if it's reasoning and I'm not saying all people are not thinking but it's more like often it can be quite refreshing... Often I've seen people over-complicating things or feeling very intimidated by all this machine learning stuff. And it's like, ok, what are you trying to solve? What are you trying to do? What makes sense? What can a computer do? What can a computer not do? What's logical and there's... What's so funny about reasoning? I don't know, it's just... I love it. Well, it's clearly not talked about enough. Sitting down and fucking reasoning. We all could stand to do more of that, I think. It's a great slogan. Not just in NLP. Yeah, actually just in life. Good life advice, yeah. Just think about stuff. But I do think some of that really also defines how we do things and how we think about running the company. I don't know, often people spend way too much time just looking at data and trying to infer stuff that makes no sense and that could be much better solved if you just sit down and think about it and are like, what makes sense? Should I do this? And not, ""I collected some data and it says I should do this."" Well, but this could mean all kinds of things. Is it logical? No. Okay. Then don't. Awesome. And I think then once you're there, it can be refreshing, as I said, because you're like oh, suddenly things make sense. Suddenly things are doable. Suddenly your problems are solvable because you're not stuck in like some weird technical rut, but you're actually thinking about what you should be doing. Is there a specific story that you're thinking of? Is there like a client interaction that you want to share with us? We did have people in the past who were like, ""Well, I want my system to be 90% accurate,"" and then we're like, ""On what?"" And they're like, ""No, no, 90% accurate."" And it's like what you doing with your system will decide about how successful your system is and that's what you want to measure and that's what you want to focus on. And I can see how this sometimes gets lost if you're not thinking about it and if all you follow is research, which has a slightly different objective because you're comparing algorithms. Yeah. Do you think you want to... I don't know. I always feel like with the stuff, when you're thinking about a big picture, it's easier to think clearly. And then you push it down to a sub problem like, ""Okay, I'm trying to optimize accuracy here."" And it's really easy to get lost as like an individual human, just trying to optimize accuracy, but then in an organization where you can't just run thought experiments in your brain, you have to actually talk to people. I think it's even easier for people to go down these optimization, micro-optimization paths, and very hard for people to pull themselves out too. Yeah. And it is also the fun part. This pyramid where at the bottom is data and then at some point you have the code and then you have hyper-parameters so people are like, ""Ooh, I can't wait to tune my hyper-parameters."" Right. Although the new spaCy library does support tuning hyper-parameters very well, doesn't it? Yeah, we do expose them, but we also have to say that we actually... That's another thing of optimizing for more stable industry use cases that hyper-parameters have never mattered that much for the models we implemented and still... I mean, now they do a bit more with the transformers that are just more sensitive to that. But we've always tried to also design things in a way that they're not so dependent on these really brittle random numbers, you set there, where it's like, Oh 0.01. Yeah, no wonder it's not working. It should have been like 0.001. That's not productive. That's something that shouldn't have to exist. I mean, that is still a lot of fun though, Ines. Yeah, it's just like... I love playing with my 0.001. Yeah. And also all this common wisdom of like, ""What should I use for the dropout?"" ""0.2?"" Why? ""I don't know, it works."" But I think it's good to also talk about this, you can see now, from the outside, if people are looking at the field. A lot of that is genuinely complex and abstract and difficult, but there're also a lot of the things that are not as deep as they might seem. We don't know all the answers and sometimes we just changing a number until something comes out that we like. When you look at your experience in your clients and your experience on your own of taking things from here's the thing I want, I'm starting to build a model too, okay, it's deployed in production and like helping someone or helping some process. Where are the unexpected places that these things get hard? I think people sort of know, maybe it needs to do little hyper-parameter tuning, but you're saying maybe that's not as important as we thought. And I think when you actually look at what ML practitioners do day to day, it's not all training models, so- Cleaning data. ... what else is there? I mean, what did you all see as the issues? I think cleaning data, OCR. Yeah, it's nice if you have everything in actual, plain texts, but often you have a PDF that someone scanned 10 years ago. Just keep keeping the things together because software is just hard, there are all these moving parts. They're all these moving parts in the ecosystem that all depend on each other. There're all like DevOps and infrastructure stuff, that's never been something that I was particularly into. I don't think you've ever been into that either. I think our team strong... All that shit where you have to wait forever to see something fail. You run something and then you wait and then half an hour later you've seen it fail and then you try something else and then it runs again. And then two days later you're like still debugging. Yeah. One of the challenges to me is what I wasn't expecting when I started because we follow up on the issue tracker for spaCy, and then we look at the kind of problems that people run into. So what I wasn't expecting is that sometimes people just try to solve... for instance, a named entity recognition task with a tax categorization pipeline. And it's often you can actually cost different NLP problems in different ways so that you can solve them in different fashions. And I think it's sometimes difficult to communicate what is the ideal way of going forward or trying to explain why you shouldn't use this, or maybe you want a rule based system for some cases, and you don't need all of this ML training. And I think that's one of the challenges also a little bit for it for spaCy because you have a lot of possibilities and opportunities. There's rule-based components, there's machine learning training, a lot of it in there as well, but you need to know hoo to use the right tool for your specific challenge. And we can never know what the exact challenge is for all the users. So this is, I think, very difficult to guide people with as well. Actually, that ties back in with the reasoning about stuff. One example we sometimes show in talks is imagine you were trying to extract stuff on police reports about the victim, where the crime happened and... I don't know some other details around it. There are lots of ways you can model that. And one would be you do it end to end, you label a name as victim, and then you label something else as crime location. That's quite the obvious way. Maybe that works. Maybe if you have a big enough language model, you can actually learn that. But often this doesn't really work, necessarily. So, then you have to think about how else can I decompose this problem? Maybe I should just predict whether a text is about crime. Then I can predict the locations in it. Then I can use other information I have about the text to resolve that, figure out that's a location. That's the crime location. Maybe that's where a passer could come in handy because you can look at the syntax, especially in a language like English, there're only so many ways you can express a thing. There're only so many verbs that are used. If you cover the most common verbs, you've covered 95% of all constructions that are likely going to occur. Maybe it turns out it always mis-recognizes some city because the model wasn't trained on data that have many mentions of that. Yeah, you could retrain it. Maybe you would just want to put one regular expression in that makes sure that this thing is always recognized because you know the answer. So there are many ways you can go wrong. And I guess also just people still like this idea of downloading something off the internet and it just magically working for whatever complex, specific thing they think of. With a lot of these language models, of course, you only want to train them once, download them and then fine tune them or reuse them. But you should always want to train a model. The question is not, do I have to train a model? It's you can train a model now. It's great. It works. You're going to make your life so much easier. You should want to train a model and not... I mean, if there's something you want to predict, if not, you probably shouldn't. We often tell people, ""Look, you probably don't want to be using machine learning for this."" And they're like, what? Oh, someone actually someone did ask me once, they wanted to implement NER ah, for digits. And I was like, wait, just sequences of numbers in text. I'm like, ""Why do you want to predict that? You can match that with a regular expression."" He's like, ""Yeah, But my boss wants me to use machine learning."" Wow. I'm like, God, dude, I'm sorry. But yeah, that stuff like that definitely happens as well, people trying to model things that don't need to be modeled. It's funny though, I think 15 years ago when I was starting a labeling company, I felt like people were sort of thinking of machine learning as like the scary science project that they didn't want to do. And now it's like they want to add machine learning to ridiculous, easy rule-based tasks. It's so funny the way things change. Yeah, but I guess it's what people get paid for. I mean, there are some rare cases where I'm like God, how... some people who express their unfriendly attitudes on the internet will have jobs where they're likely to get paid a ton to do machine learning and hassling us about pretty basic stuff. I can see how everyone wants to work in machine learning because it's probably a nice job, but it doesn't always mean that what you're doing there is particularly good. Well, cool. Thanks so much for the time and we'll put a link to the new spaCy library in the show notes and maybe we can put in some tutorials to help people get started if they want to give it a try. Yeah, cool. Train a digit recognizer with good accuracy. Don't forget the hyper-parameter tuning. Yeah, don't forget was a wide hyper-parameters search. Exactly. Yeah. If you're lucky, you can get to 95%, I would say pretty good on any machine learning Task. Awesome. Well, thanks so much. Thanks for having us. Thanks. When we first started making these videos, we didn't know if anyone would be interested or, or want to see them, but we made them for fun. And we started off by making videos that would teach people. And now we get these great interviews with real industry practitioners and I love making this available to the whole world so everyone can watch these things for free. The more feedback you give us, the better stuff we can produce. So please subscribe, leave a comment, engage with us. We really appreciate it.",10584 +Daeil Kim — The Unreasonable Effectiveness of Synthetic Data,https://www.youtube.com/watch?v=QJ6DgjxFxmg,2230,2020-10-16,"The hard part is diversifying the content. So if we just have the same character in an environment doing everything, it's not going to work, right? So how do you actually create hundreds or thousands of variations of that character model with different behavior and things like that? That's been really the core focus of how we're thinking about our technology. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Daeil Kim is the co-founder and CEO of AI.Reverie. A startup that specializes in creating high quality synthetic training data for computer vision algorithms. Before that he was a senior data scientist at the New York Times. And before that he got his PhD in computer science from Brown university, focusing on machine learning and Bayesian statistics. He's going to talk about tools that will advance machine learning progress, and he's going to talk about synthetic data. I'm super excited for this. I was looking at your LinkedIn and you have a little bit of an unusual path, right? You did a liberal arts undergrad. Yeah, that's right. Can you say a little bit about... I feel like I come across people quite a lot that want to make career transitions into machine learning and related fields. What was that for you? What prompted you to do it? That's a great question. Wow. Searching back. I studied literature in college, so I actually did not have a lot of computer science background, and I've taken a lot of twists and turns in my life. Sarah Lawrence College is a pretty unique educational system. It's really small class sizes, Socratic system, liberal arts, humanities. I think from that, I garnered just the curiosity about the world. And then afterwards I did a lot of research in schizophrenia, so I studied mental illness. I was taking people inside MRI scanners and then analyzing their brain data. I spent about four years doing that after college. It's again, transition like, after college, no skills working at a wine shop, and then over time, volunteering at a lab and getting into that position where I started actually publishing papers and really getting into computational neuroscience. And then I wanted to be a doctor at some point, but then decided at the last minute to study machine learning because I was actually really interested in understanding the underlying fundamental aspects of intelligence. What does that mean? How can you actually model things like that? So instead of going to medical school, I decided to just do a PhD in computer science. After that, I wanted to try journalism, trying to see if I can apply and build tools to help journalism. So I worked in New York Times for a few years. And then finally, I was like, ""Okay, I really want to do this stuff, synthetic data."" So it's a lot of twists and turns, I have to say. I would have never been able to tell you this is where I would have ended up 10 years ago. So it's been quite a journey. That's so cool. That's an impressive skill to be able to completely switch fields like that. I think I'd be too afraid maybe to make a leap. Yeah. I think it's not easy. Let me just be clear. It's not easy learning the math, for example, with machine learning. Yeah. When did you learn the math? Because I feel like that's a place where a lot of people feel nervous. Did you take math as an undergrad? Or how did that happen? Not really. I didn't actually take a single math course in undergrad. So I had to learn. Actually, my PhD was Bayesian nonparametrics, which really gets into pretty complicated math, and variational calculus and things like that. Basically, I suffered [laughs] and I spent a lot of hours just learning. I took some classes as I could during schizophrenia stuff, during the research of that aspect of my life. I had to learn some level of statistics in that, and probability to be able to analyze that data. But then once you get into the machine learning stuff, and especially in that area I was in, you really needed to up your game. And then that's where I spent a lot of time trying to play catch up. You know, I learned a lot and it was an unbelievably fruitful experience I would say. So it's very rewarding. Do you have any tips for people trying to learn math outside of an undergrad curriculum? I think actually one of the best ways for me was actually appreciating the beauty of math. A lot of people are scared of math and thinking, ""Oh my God, I have to learn these rules, and first, second derivatives. I have to memorize these things."" But once you get into more of the theoretical stuff and you start thinking about basically like... I'm not sure if you've heard of these insane millennium problems and P equals NT or the prime number, stuff like that, the Riemann hypothesis. There's so much beauty there and you can actually read about it and understand how challenging some of these problems have been and how profound they are. Being able to appreciate it from an aesthetic level, I think helped me give the patience I need to learn it a little bit more. You need to be patient. Your brain is not going to just pick this stuff up, if you've never been exposed to it, unless you're a lot smarter than I am, which might be the case. Yeah. Sounds unlikely. I don't know. For a long time, I've had a real interest in synthetic data, which is what your company does, but how did you get interested in synthetic data working on journalism? So my adviser actually at Brown was a computer vision person. So I got exposed to a lot of the problems there. So you go to these conferences and it's always the same data sets being used. One thing that I actually wanted to do is one day build my own video game. I wanted to be able to actually create worlds. I wanted to see if you can integrate machine learning. That was an early interest of mine as I was learning this stuff. I've always believed simulation was such a powerful tool for a lot of things. So at some point in New York Times, I had a really great experience there learning all sorts of things, amazing community of people. And then from there I really wanted to do this thing I've been dreaming of doing, and I knew that there was such a huge issue. The way I look at sometimes how science advances I think is actually through tools. I mean, you're building a great one with WandB. If you think about the microscope, for example, right? Before that, who... You know, there's entire fields that opened up. So what I'm hoping to do is I'm hoping to figure out a way to create a sort of a simulation platform that can one day be used by a lot of people. And at some point just introduce new people to ideas about how you can train computer vision algorithms without the standard process of collecting data in the real world and where simulation can actually play a really useful role. I think that really excited me and I actually think that there could be a lot of really important advancements in the celebration of computer vision with the adoption of synthetic data. I see. So you'd actually been thinking about synthetic data for quite a long time. I should say, I don't know if you know about my previous company, we were talking about is CrowdFlower and became Figure Eight, and we developed data collection. I think it's funny. I think it was sort of similar experience to you of actually looking at conference papers and realizing they're all built around the same data set almost always based on the data sets that were available, which feels totally backwards. Right? You know? Yeah, yeah. Yeah. I think especially as a just starting out grad student researcher, you're the one that ends up spending a lot of time with the data sets. So you realize how massive they are and idiosyncratic they are. Absolutely. I would just also add that a lot of my work during my PhD was in Bayesian models. So there you have this notion of prior belief, you then estimate your posterior from that. But in deep learning, it's not that easy to establish a prior in a way that you can really control. I actually think synthetic data, at least for computer vision, the data itself can actually act as a really interesting prior. So there's connections there that I think I took from my own work of wanting to think about how to incorporate that. Simulation is one aspect of using data to generate that prior. Well, we always want to make this show for people that work in machine learning, but aren't necessarily domain experts in every [area]. Sure, sure. Maybe you could explain what synthetic data is and- Yeah. Absolutely. ... your take in how your system works today and then how you imagine it working in the future. Yeah. So the way we're talking about synthetic data is basically data that is generated from, let's say, a game engine or something that doesn't come from the real world. It's sort of artificially generated. Of course, people talk about synthetic data in NLP as well in generating fake texts or text that's relatively useful there. But for our purposes in our startup, we're primarily focused on computer vision. And so what we try to do is we try to create these very photorealistic, virtual worlds. We extract images from them. And then the nice thing about doing that in a simulated world is that you can encode some of the things you need for supervised learning in computer vision like the annotations and all that stuff directly. So you can help bypass that part of it and then help streamline that process. That's what we're focused on. We've been at it for close to four years now, and essentially we were trying to see where synthetic data works really well and how to push the boundaries there. Where does it work well? How real is it? Yeah. Great question. I like to think of what is a narrow problem and what is not so narrow, right? I say narrow AI all the time, and then... Think things like conveyor belts, right? Let's say you're processing certain types of food items, things like that. You're not going to see a random golden retriever jump on or things like that. The diversity of that scenario is not that large. I think their synthetic data really shines. That's one of those places. Of course, people are using it for really complex things like self-driving cars and things like that. But I would say if you want to think of a heuristic, the more narrow the problem, the more synthetic data will play a role. But of course, on the other end, there's attempts that we'll make to try to create that diversity. So the way we think about it as a company is how do you create diversity and how do you scale that? So we incorporate a lot of proceduralism in our world. We think about how to procedurally generate meshes, geometry, 3D models, things like that, how to automatically change the terrain, all that stuff. That's really a big focus of our work, and then understanding how you can quantify that gap between synthetic and real data through benchmarking of algorithms. That's where we use a lot of WandB as well to understand that. So how would it work? Say I'm trying to imagine what I might be doing, where I would want to come to you, and you can tell me like, if I'm trying to do factory automation since you said conveyor belt. I want to classify, does this machine look like in a normal state or a broken state. Right, right, right. So I'll give you an example that I can talk about a little bit. One problem is this company we're working with called Blue River. They're trying to solve this problem of being able to identify weeds in a crop field. It turns out that if you were to able to target the herbicide you use, you can reduce the amount of herbicides by 95%. Right? Farmers are just spreading that all over. So for what we've done on our end is that we've created an environment where we actually procedurally generate weeds with different vegetation stages, and all sorts of things like that. And then be able to then automatically annotate that via segmentation mask and then train an algorithm and show that we are getting X amounts of improvement. Another example I can talk about is 7/11. We're working with them where we're actually creating a retail store with all these items, and they're interested in things like activity, understanding grasp, pose detection, things like that, grasp intention. We have our own motion capture studio, so we actually have a lot of really cool animations that we can generate from there. We create all that simulated data and then all of that has perfect ground annotations and we feed it to them that they can basically download and then use to train their own algorithms. What's the point where... I mean, those are great examples that make total sense, but it also strikes me as kind of tricky to set up, to make it really realistic. What's the sort of scale that you need to be at for this type of approach to make sense? Yeah. It depends ultimately on their data set they're benchmarking against. Actually, when we work with companies, we often ask, can you share at least in the valuation a real-world data set that we can benchmark against. So oftentimes the first iteration that we run and create this environment might get you a certain percentage like 60% of the real database line. Real database line being, if you were to train the same algorithm on the real data only, what is the thing you would get from that? And then we keep iterating and improving. We have ways of finding out where the gaps are in terms of the synthetic and real data. And then we have a whole team of procedural artists from the game industry that actually work to develop better ways of actually creating more diversity within those scenes. So it's not something that happens instantaneously, but it is something that once you build it, it's there forever. So you can just keep generating more and more data and iterating on that. So the early parts of our company was just trying to create that infrastructure and then being able to have a streamlined process of iterating on that. The way I like to think about it is sort of a virtuous cycle. We generate the environment, we collect data, we benchmark it and then we iterate again and again and again, until we get to a point where we're happy with the synthetic data. But on the first time, it's usually never a ... You know. Unless it's a very simple, narrow problem, you usually don't get up to the same performance. And then depending on the problem, you'll look for different things in terms of what to improve. You might miss a certain type of orientation of an object, or you might have Zoom levels that are off that you didn't account for. Did the images that you generate look extremely realistic? Is that really important to making it work well? I think if I had to choose one, diversity is more important than photorealism. So I'm defining photorealism as the way people think about in computer graphics where you're modeling the light rays bouncing off of every part of an object and calculating that. That's how you get those CGI level realism. Because I do think that the technology that's coming out with the latest version of Unreal Engine and NTD is coming out with a global illumination system. That is just going to happen and GPUs are getting more powerful. So that level of realism is there. But the hard part is diversifying the content. So if we just have the same character in an environment doing everything, it's not going to work, right? So how do you actually create hundreds or thousands of variations of that character model with different behavior and things like that? That's been really the core focus of how we're thinking about our technology. I see. This must be really hard, but if I came to you and I was like, ""Hey, I want my accuracy to go,"" How would you even think about that? What kinds of performance gains do you predict? Well, let me answer that question in two ways. One way is there are scenarios where the only thing that could really work is synthetic data. Let's say you have a conveyor belt of, I don't know, ceramic mugs, and you need to also have an annotation around how much they weigh, right? You could potentially actually estimate that in a synthetic environment, because you might understand the materials and you can calculate that while it might be hard for a human annotator to look at that and be like, ""This is 37 grams, right?"" So there are scenarios where actually it can only seem to work with a ground truth thing. So there's an advantage there. In terms of performance, I think it really depends. I can give you off the top of my head for the narrow cases, you're essentially looking at 90.99, .98 mean average precision for things like that. When you're starting to talk about much more complex things, we released a paper called RarePlanes, where we actually released with Cosmiq Labs a huge satellite, synthetic satellite image with airplanes and all that stuff that's already been annotated and synthetic version of that. There, synthetic alone will give you 65 to 70% of the real baseline performance, but then what we do, and we would like to advocate for this. There's several things you can do on top of that. One is transfer learning. You can actually just take 10% of the real data. And then you start getting into the 95% of the performance of the real world data, just using 10% of that. And then you also have things like domain... Sorry? I just want to make sure I understand what you're saying. So you train on the synthetic data first and then you transfer to the non-synthetic data? Yeah. You take the real world, just 10% of the real-world data and you fine tune it off of that. So you can either pre-train it that way... What's that? It's a final step you fine tune it on? Yeah. I see. Exactly. And then you get much better performances. Of course, that 10% comes from the real world training set, not the test set. So the fine tuning stuff, I think the fine... The way I look at why that performance got us up to 95% is that I think you're feeding the sort of prior version of what the algorithm thinks the world should be. And then all the sort of noise that comes from the sensors and then any unique variations of that can be transferred in that fine tuning step and taking that fuzzy vision and then sharpening it with some real world data. You say 10% of the training data. Did you take the other 90% and use it in the initial model? No. We would just randomly sample 10% of the real world training data for the fine tuning stuff and then we'll first train it off just synthetic only. Right? So we train it off the synthetic data first. That gets us to something like 60 to 70%, at least in the airplane's scenario, which is a bit complex. And then when we take just 10% randomly sampled from the training data set and the real world images, then we get to the same 95% of if you were to train on a hundred percent of the real world data. Oh, I see. So you're saying it's 10 times as efficiently using the labeled data. Yeah. You're saving a little bit of 90% of the real world data needs. Yeah. And presumably if you used all the real world data, you'd make an even better model? We found that it actually tapers off a bit after 10%. After 10%, it tends to taper off. I mean, diminishing returns, which is what I'm saying. I see. But of course, they're still there. Again, this is one scenario. Different scenarios have different performances and it really actually depends on the data set you have at the end of the day. I think other people point this out all the time, not enough people focus on the data itself. So if your data set is really wonky, who knows what you can train off of it and who knows if the benchmarks even are useful there at that point. There's a lot of things to consider, but generally, we find that fine tuning helps. And then the other thing I wanted to bring up was domain adaptation, which is the set of algorithms and computer vision that tried to transfer to statistics from real-world images to synthetic images. So algorithms are like image to image translation where you can maybe think style transfer, take that from the real world images and try to incorporate that noise, interesting real world noise into the synthetic images themselves. Oh, interesting. Can you really see that in the image? Yeah. Yeah, actually. I mean, Nvidia has done some great work on this. So Nvidia has definitely done a lot of good work in domain adaptation. And these computer vision conferences, it's been a really active area of research. It's sometimes not that distinguishable. What you find is more texture differences versus shape differences. As you can imagine, those are probably the more difficult things to transfer, but it does help. It definitely does help for certain scenarios. Yeah. Interesting. I mean, I think the first thing you said when we were talking is you envision this as being a tool for people to use, but it sounds like maybe today you need to involve real artists. Right? I would assume that the interface isn't really a tool that I... You know, it wouldn't be like a TensorFlow. No, no, no, no, no, no. What's your plan to bridge that gap? Yeah. That's a great question. You can almost think of it like a download a little video game, right? At some point, if you build an awesome enough environment and then we're building out a whole UI and productizing that process. So you can imagine these virtual environments living on the cloud somewhere, and then you have an API that allows you to tweak certain things like lighting, time of day, how things are spawned, all of that stuff. And then you'll be able to collect your own data that way. Right? So it's not so much that we give people the ability to create their own 3D worlds, as much as we'll create it and give them access to this huge environment that allows them to collect as much data as they want, and to see how far we can push that. We're still building that out, but I'm hoping that once we start doing that and we set the paradigm for that, other people will follow and understand the value of that when it comes to computer vision. And yeah, so hopefully that's how. There are some really other cool ideas too, like stuff around medicine stuff where you're actually, if you can create an environment that has a lot of really interesting ways, you can modify it through API calls or some scripts. You can then imagine the reinforcement learning algorithm that can explore and exploit a whole range of parameters to figure out how to actually get the best synthetic data, right? Where the reward function is tied to, let's say, your mean average precision and things like that. I'm curious, like in your company, is it mostly artists making this stuff or is it mostly machine learning people or is it graphics people? What's the composition of- Yeah, it's a very interesting mix. One of the best things about working with all these folks is that they come from a wide range. We have procedural artists. We have your standard technical artists. So procedural artists will be able to do some amazing things with geometry and create all sorts of geometry procedurally. We have people who understand how to create procedural textures, materials on those 3D. So a lot of game developers. We have animation people. We have the motion capture. We have game engineers to be the glue that puts all this stuff together. And then we have a whole team of deep learning people who actually benchmark that data. Content is generated from one side of the company, gets fed to the ML people, and then they're like, ""Ah, it's great. No, it's good. Like, we need to do... You know, this is not working. This is working."" And then it goes back. So there's a constant conversation between these two groups of people where we're always trying to improve the data and understand what's missing. Did the ML people do any of the image generation now, like with GANs and other techniques? Have you started to do that or is it mostly more classical procedural generation? I'm not super familiar of the field, so I don't know how- Oh, yeah. We've tried some of the adversarial stuff. It's not easy to get GANs to optimize. There's a lot of issues of mode collapse and things like that. So the adversarial networks, you tend to use that more for domain adaptation. So you can imagine those techniques where you're trying to create no distinguishable difference between the synthetic and real data that those GANs can be very useful there. We are still actively doing some R and D on geometry creation. There's a really cool paper out called PolyGen from DeepMind that does some cool work on that space. But yeah. Right now, what we're trying to really focus on is trying to create a whole suite of procedural tools that are based off of tools like Houdini. I don't know if you've heard of Houdini before, but it's a way to create procedural geometry. And then we've done a lot of good work with that. At some point, we want to move more towards a system where we can just train off of our current library of 3D models, which is really large right now, and then be able to generate new models there. But I think it's still a little bit more R and D that's necessary to get that stable. That's my guess, but I don't know. Maybe somebody has an amazing algorithm out there that works all the time. So, yeah. So most of the models that you're actually building are vision models, it sounds like. Yeah, we're primarily focused on vision. The reason why is because as much as we'd love to do RL based things, vision is nice because it doesn't have to require the kind of physics necessarily in a game engine, which... You know. What we use is we use unreal game engine to build our platform. There, the physics isn't as accurate as you would need to get the right RL stuff working. So we're waiting until that becomes more mature before jumping into an RL. But right now computer vision is our primary focus. Interesting. You're waiting for a good physics engine to do RL. That's sounds like a real opportunity. Yeah, yeah, exactly. But once we can create these unbelievably rich realistic world, then incorporating the physics will be the next step. And then we'll get a cute little dog running around in the field, jumping over rocks. I'm curious. I've played a little bit with MuJoCo. What makes that not something that you could use for this kind of thing? I don't have as much familiarity with that platform as well, but what we love about the unreal game engine is that it is such a powerful suite of tools, and it is capable of a lot of stuff. A lot of huge scaled worlds, really, really being able to have high performance photorealism. Especially with the new unreal engine coming out, you're going to be able to get near photorealism in real time. So all that stuff is just... It really allows you to create rich worlds and some of the other platforms that I've seen really aren't built for that in the same way, in my opinion. I was looking for the most robust system to build all of this stuff to be able to... One of the cool things we do is we, for example, generate huge cities. So we'll take things like open street map, geospatial data, things like that. And then we'll generate a big part of Manhattan, for example. And that takes us a few days to just put it through a system and then out pops this fully virtual 3D world that you can walk around. So this is stuff that I think the unreal engine is quite well suited for. Switching gears slightly to the ML team because I think that's going to really resonate for people listening and watching this. You've now been building models for customers for four years, I guess, which is probably longer than, or at least building proxy models tweaking ML for enterprise and production. I wonder how have your processes and tools changed over the years that you've been doing it? Yeah. I just want to caveat this by saying that we don't try to actually create models that are going to be used by everybody in the world, or in production. We train models for the purpose of understanding how good our synthetic data is. Unfortunately, we're not spending all our time pushing the boundaries on the next version of transformer architectures. We're not as focused on that for example, but we're more focused on trying to understand. I actually think it's a different way to think about optimizing your model. Of course, you can optimize it through hyper parameter searches, messing around with learning rates, all sorts of things like that. But actually, the way we do it is that we'll try a few things here and there in terms of the hyperparameters, but we're really focused on what the data tells us. So we can quickly go back, and within a few days make conservable changes to the day that we have. And then that's almost how we think about tuning our model and improving it. So we're taking a data first approach in terms of optimizing the performance of our vision models, and we do it for the benefit of the customer. We want to be able to show and prove to the customer that this data is valuable and that's useful. That will resonate with a lot of the people that I've talked to. I think most people in the real world tend to focus a lot on picking and choosing the data to make the models work well. Have your systems evolved for doing that? Obviously, it's Weights & Biases. That's how we got connected. But what are other tools do you use? Yeah. Weights & Biases is awesome. We love it. We've been using it a lot for understanding how our models are performing, but for us, we have our own data centers. We have our own co-location system, and then we use something called Polyaxon to orchestrate all of the experiments. So let's say you want to run 20 or 30 experiments. We have a system like Polyaxon that orchestrates all that, but it's also tied to WandB. So we get all sorts of cool metrics to understand how the model is doing, and we can plot out a lot of stuff. We've also created our own customized dashboards to understand the difference between synthetic and real data. There's some really cool things you can do with some of the new transformer architectures that can generate visuals attention maps to understand some of the differences between the synthetic and real data. But at the end of the day, a lot of it is around that part of just trying to get the synthetic data. It's all focused on improve the synthetic, improve the synthetic data. And then once we get it to a point, then we feel good. Then we can start doing crazy things with it like, ""Okay, this edge case that never happens in the real world. We'll create it."" Or this perspective all of a sudden is changed and you need a whole new dataset where the camera angle is now different because it's in a different place. Well, okay, we'll generate all that. There's things like that, that we also do a lot of. So it's a relationship we have. So those are roughly the tools. We're not like a huge startup where we have 50 ML people, but it's a pretty nice pipeline. Also, our data is all API-driven. So, we just literally a few lines of API code. We get the data we need, and then it's streamlined into this whole orchestration of experiments. And then once we get the performance of that, and then we have our Weights & Biases and dashboards and all these nice visualizations to understand where the differences are. Do you use TensorFlow or Pytorch or something else, or all of them? We're Pytorch fans. Yeah, yeah, yeah, yeah. I mean, TensorFlow is great for production stuff, but Pytorch is just so nice in terms of debugging and so there's a lot of stuff... It's just also kind of a culture thing. You start off with Pytorch and then making the switch to TensorFlow is a little bit hard. But yeah, so most of our stuff is in Pytorch. You started with Pytorch like four years ago when you started. No, no? No, no, no, no, no. Well, keep in mind the first year is a little bit like- Yeah, sure. You know, swimming in the open ocean, trying to find the island. There's a little bit of that that happens. Right? And then of course Pytorch has matured a bit over the years. Definitely there was a little bit of just trying to get the other stuff working, but yeah. So, in the past year, year and a half, it's been Pytorch primarily. Cool. Well, thanks so much. This has been super interesting. We actually always end with two questions. Okay. Sure. The first one I'll tell you. So what is one underrated aspect of machine learning that you think people should pay more attention to than they do? I mean, given that I work in synthetic data, I'm a bit biased here in my response, but I really think that as much as time as you can spend on the architecture, sometimes it's really just the data that is an issue. So, I think people need to think more about the data and what that data looks like and understand what you're working with and the biases inherent in that data. That's a big thing.",6017 +Joaquin Candela — Definitions of Fairness,https://www.youtube.com/watch?v=KP5PhuwYahI,4757,2020-10-08,"There isn't one definition of fairness, right? If you look at philosophy, whether it's moral or political philosophy. Or you look at the law. Or even you look at the vibrant community in the computer science community and machine learning who is thinking about algorithmic bias. One common pattern is that you have multiple definitions of fairness that are mutually incompatible. So you have to pick. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host Lukas Biewald. Joaquin Candela is the tech lead for Responsible AI at Facebook. Prior to that, he built the applied machine learning team. Which powers all production applications of AI across all of Facebook's products. Before that, Joaquin taught at university of Cambridge and worked at Microsoft Research. Today, I'm going to talk to him about fairness and algorithmic bias, scaling and democratizing AI at Facebook. You were running the applied machine learning team at Facebook, right? During a time when there was tons of machine learning innovation going on. I'd love to hear what was happening when you started working on that. And kind of what tooling was necessary and how that kind of changed over the time that you were working on that. I think the context is very important here. So I joined Facebook in mid 2012. There wasn't a lot of ML people in the company. And if you think about the two biggest applications, it was News Feed ranking on the one hand. And then ads ranking, right? So two ranking problems. So as far as the models were concerned you mostly had binary classifiers. That were used as inputs into a ranking function, right? So if you think about news feed ranking, you would have my value function is some combination of, I give every click a certain score. I give every comment a score. I give every share a score, et cetera. And then we've got to build myself a value function. And so I have all these binary classifiers. That predict the probability that someone will click share or comment or whatever. Before I show them something. And then I kind of use that to sort of rank content. And for us it's a similar thing. Writing ads back in the prehistorical times, click based advertising was the big thing. Maybe like... I don't even remember now. Like 15 years ago, 20 years ago, whenever. And then you know that you had conversions. And then just more subtle things where you have brand. And then not all conversions are created equal. And then the only thing that happens of course, is that the complexity of the content evolves. If you think about when I joined Facebook, a lot of the content was mostly texts. Images of course there, fewer videos. And now that sort of becomes more complex and you have more multimodality. So I joined Facebook at a time where the company was just IPO, Inc. And revenue was flat. And so there was a huge pressure to try and move the needle in ads. I joined the ads team. And one of the big levers to move revenue was like, ""Oh, can we get better at predicting clicks and conversions on ads."" But at the same time you start to have... We started to move away from only serving ads on web on the right hand column. To actually also serving ads on mobile. And then actually end of 2012 when I joined, if you look at where people were accessing Facebook from. Web was kind of slowly declining or being stable. And then mobile was rocketing. And I think they crossed sort of at around the end of 2012. The types of surfaces you have, the types of things you're predicting starts to increase. And so the first dilemma that I had, was I looked at what were we doing? And we were using just so use in a way. Like to go to the space station, nothing fancy just like the good old so used. We were using a gradient boosted decision trees as feature transformers. Mostly you could think about it that way. And then we were using online logistic regression sort of after that. Cascaded with it. So- What would you... Sorry to interrupt, but what then would you train the intermediate gradient boosted tree on? What would be the kind of thing that, that would try to predict? ... You'd still train them on the event. On the binary event that you're trying to predict like clicks or conversions or whatever. But obviously you'd benefit from the robustness that, that gives you. You don't have to worry too much about scaling, and translation, and whatever of your features. But then you would feed them into a simpler model? What you would then use is the trees themselves. Every tree has a categorical feature as it were. And so then your logistic regression model which would be training online, has a bunch of inputs that are categorical. Which are the outputs of the tree. So it's basically kind of relearning the weights associated to its leave. Interesting. Wow! I had not heard of that. So the thing that's changing then is sort of the combination of the... You train a graded boosting tree, then you pull the trees apart and then you relearn the weights of each tree in the combination? Yeah. It's a hack. it's not a fully back propagated model. Because you train your trees every few weeks whenever. And then you have logistic regression that takes as inputs both like the binary indicator. So every one of the trees you train, like hundreds, maybe a couple of thousand trees. So you have... And any two trees has a dozen leaves or whatever. And you treat those as one out of 12 kind of like in coding. But then you're learning a weight for each of those. And you're kind of running in real time. And then you have other features that go in side by side. That can actually be sort of continuous features as well. That's the setup. That's what I found when I got there. And so the key decisions, since you wanted to talk about building applied ML and all that. The dilemma was the following. Was like, ""Well, this thing is begging for a proper neural net to be thrown at it."" It's almost like we've handcrafted a Frankenstein type of neural net by having these threes. With logistic regression concatenated to it. But we're not even training it together. We first train the threes and we kind of chop the output. And then we kind of plug this other thing to it. And then that's the thing that we train. So it was already obvious that doing that would probably give us against. This was- And this is actually... So 2012 it was obvious that a neural net would give you an improvement. I'm trying to remember. ... Sort of. Was that obvious to everyone? No. Because if you think about the... I think it was Russ Salakhutdinov and Jeff Hinton. And I might be forgetting some co-author. So I deeply apologize because I've had a long day already. Apologies. But this was the big image net paper. Was with convenance I think from 2012 if I'm not mistaken. Mm-hmm (affirmative). That sounds right. Yeah. So I don't think it was clear. I think it was just the beginning of the hockey stick. But I think it wasn't clear. If that had been two years later, then it would have been obvious, right? Right. But at the time it wasn't clear yet. And you always need a couple of months to realize that something happened. So thing that really struck me was the time it took from doing an experiment... Which a lot of it was really feature engineering. Maybe there were some experiments with tuning your learning ratios. And tuning the architecture of your trees and the architecture of your logistic regression. Although there isn't a lot of architecture to be tuned with logistic regression. The time to go from someone has an idea and starts to do some offline experiments, to actually have your new click prediction. Or conversion prediction model for mobile or whatever in production. I'm actually materializing those gains would be several weeks. It'd be sometimes six weeks, sometimes two months. And I thought, ""Holy crap, that's not great."" And so the crossroads in a way, on the one hand you're like, ""This thing I have is simple, but we're still getting a lot of gains by tuning it."" And on the other hand, I can go and just replace my so use with something sophisticated. So that was the crossroads that I... So do you want to know what I decided to do? Tell me, yes. I feel like you're kind of picking on this so use that. I didn't know that was the metaphor for the tree thing. It's true. The so use... Well, I think the so use is rudimentary in the sense that the computer systems that the so use has in it. For probably 50 years old or something like that. But they work. So the reason I use the so use analogy, is more like it gets the job done. It's like a gradient boosted decision tree and logistic regression. It's like this as an aside. One thing that triggers me these days a little bit is I see people jump straight. If they'd have to solve an NLP task, they'll use either some sort of a sequence model. They're using LSTM. They'll use... What I mean is that it's a transformer or whatever. And sometimes you'll go like, ""Did you try like a max-end model?"" ""Did you try a good old bag of words with logistic regression?"" And the surprising thing is that I would say between 20 and 50% of the time you get the same results. And then you're like, ""Did you realize how much cheaper this thing is in terms of anything you care about?"" Whether it's training time, inference time, whatever. So basically the big bet there was to say, ""Well, what do we need to do here is we need to actually allow our teams to ship every week."" And that was the big model, was like ship every week. Do whatever it takes so that every week we can ship new models in production. And what that meant was we need to dramatically accelerate that path from, ""I have a new model that I could put in production."" To like, ""It's in production."" And that kind of triggered the five years of work. And so what were the keys? I mean, tell me the pieces that you needed to build in order to allow that to happen. Because I'm sure a lot of people listening to this are thinking, ""I'd like to ship a model every week."" What do you need in order to do that safely? It was many things at different levels. So at a very low level, it's about fitting in seamlessly. Whatever infrastructure you have for inference. And adopting some sort of standards, which seems super easy and trivial. But even that you shouldn't take for granted. The part that I thought was even more interesting is that I think what was slowing people down was probably two or three things. One was, it was extremely difficult to share work between people. Because people would be running experiments in their own dev servers. And even having... As we all know configs back then weren't sort of easily portable. It would just take you a couple of hours or whatever. You'd have an energy barrier before I could actually play with what you had done. The second thing, which I think is related is that you started to have a lot of teams reinventing the wheel. So a lot of the work that was being done was actually duplicate. Because as a number of surfaces on which we showed, ads sort of kept increasing. And the types of modalities kept increasing. You kind of had teams that focused on one of those evoke cells in your tensor of configurations. And they wouldn't sort of easily talk to each other. Or the work wouldn't be discoverable. So the thing, number one was automate everything. You have to automate everything. You have to make it ridiculously easy and you have to abstract. Everything from the engineering, trying to deploy something. Especially because we're growing very fast and you get a lot of people who are joining the company fresh from somewhere. Maybe they are good applied researchers, but they're not in for a people necessarily. So abstracting and automating, super important. The second, share-ability. Make sure that you abstract and encapsulate things in a way where they're super easy to share. So I can see what input features are working for you. If you're working on conversion prediction models for in game ads or whatever. I can super easily see that. Obviously you have infra work again like codes. The way we store and represent data is very heterogeneous. So it's a pain in the butt usually to... Even if you're only looking at reproducing training depending on what your set up is that's work. But then going... Obviously the way you run your data pipelines when you're training offline. Versus when you're trying to serve in real time is different almost always. And obviously, when you're online, you're on a budget. So you want to make as few calls as possible when you're serving. SO you got to sort of figure out how to abstract those things. And again hide all the complexity. And then the third one, which I think is really interesting is really think about collaboration by design. How can you build an environment where I go in and I can see every single experiment anyone has run. And I can go and by clicking, I can see first of all who they are. Who they are is huge, right? Because then I know who to ask. Especially if I'm new to the company and you have a company that's growing fast. So the equivalent of your git blame or whatever is super important. You need to know who people are. The second one again is so much is wasted in terms of replicating experiments. That someone has already done. So bookkeeping is extremely important. And then the ability to just beg, borrow and steal bits of pieces. Either of feature competitions or models. We were exposing learning curves and things like that as well. So you can actually sort of browse them. And then another component... And I'm not being super organized here. I think I've said it's three things. And I'm at the fifth thing already. But another one is try to be as modular as possible. And if possible as well, language agnostic. And separate out the language or the platform. That you're using to kind of specify whatever models you're building from the definition of a workflow and execution of a workflow. So it's really abstracting that away. And sort of thinking about an ML workflow, is in an ML workflow. And I don't care if you're specifying your models in MATLAB, Octave, Python, PyTorch, TensorFlow, whatever it is that you're doing. A lot of the bread and butter that you're doing is kind of common. So really layer it, modularize it was sort of huge, It's interesting. I feel like the things that you're saying are the things that all ML leaders want that I talk to. But I think that the place that they get tripped up. I mean, all the benefits that you're saying totally makes sense. But I think the sort of downside is that it requires getting everyone's buy in into kind of a standard way of doing things. And I'm curious how you got that. Because ML practitioners are so hard to hire. And they're often opinionated in working different parts of the org. How did you get them all to buy into the same system? Often opinionated you say. I would like to meet one that is not opinionated. Sometimes not opinionated. It's tricky and this is actually... So you're putting the finger on an amazing point. Which is really a almost like change management. Very hard, but several reasons why it's hard. Reason number one, in any fast moving company where you have low hanging fruits... I mean, this is not unique to a ML. Who's going to actually pause and do and clean up the kitchen. And pay back some tech debt or build in first so you move faster. You're almost like, ""Hey, why don't you do it?"" Like, ""I don't feel like doing it myself."" So that's one challenge. The other one of course is a sense of pride that people have. I mean, and especially... I used to be in academia. And in academia, the thing that determines your worth is almost the praise you get for the work that you do. But you put your name on your papers. So culturally it's tricky to sort of say, ""I'm going to surrender some of that for the greater good."" So the tactic that we took, one of them to assist would be ridiculously laser focused. The one thing that I should have clarified is I never dreamt that one day I would build the applied machine learning team at Facebook. That was not the intent. I was in ads, we're focusing ads. But even within ads, we already started to have several teams working on similar aspects of the problem. So at least we work on generating alignment and a vision within that. And that was not like a million people. There was just a couple of dozen people. And we were all feeling the pain and the urgency to move fast. So it was semi obvious that this was going to be good. It was a bold bet. So you need to kind of generate alignment, both on the people who are deploying things. And doing experience everyday, but also get management to give you air cover. Because things are going to slow down. I can remember talking to my manager at Fields Coffee end of 2012 where revenue was still not picking up. And he was asking me, ""Hey, you haven't been shipping models a lot often. What's going on?"" And I'm like, ""Well, actually we're going to slow down even more."" And it was like, ""Explain."" And then you explain, you get into the details. You get like buying on the vision at all levels. But you keep it very narrow. And then what happened once we started to have progress. And stuff started to move faster and you saw productivity increase. Then we started to talk to the feed ranking team. And the feed ranking team we decided to join forces for summer 2013. And that was really interesting because there again, you have to just be laser focused. Don't think about the features first. Don't think abstract first. Don't think about... It's not like platform first and then we see what happens. It's like be extremely concrete. Like here's the types of things I want to make work. And also just accept that one day you have to rewrite it and that they can. But for now you want to prove the hero scenario. You want to prove, ""Hey, this can actually be amazing."" So that was the approach. It was extremely laser-focused. Start very small, start adding people, build almost a community that supports. Really go from a core and then started expanding. It's interesting at Weights and Biases we make a tool and I actually didn't really realized how similar our tools, vision as to what you were building. Our hope is to really help with collaboration and reproducibility. And sort of the same idea of we really want people to be able to find the person that made the thing. And not have to redo all the work from scratch. And I think we have maybe even more trouble than you getting buy in. Because no one owes this anything. Why would someone want to use our tool? And I feel like for us, a big part of it is showing little wins to the individual practitioner. I feel like there's little details in our product. That we try to just give something helpful right out of the gate to someone new coming in before they do the collaboration. And before they have to really buy into our system. I wonder if there's any things like that for you? Or people like, ""I want to be able to see the system metrics of my runs."" Or something like that, that got people to use their stuff. Yeah. Excellent question. I'm just mining you selfishly for features for a product really. So shameless. One caveat Lukas that I have to say of course, is that when I was very involved with this stuff in the early days, that was already eight, seven and six years ago. I know that things have changed a lot. I know that you have open source tools today. Which if we had had them, we would have just used them directly. Including maybe Weights and Biases products. So your question is if you set aside the collaboration benefits and all that. Just in terms of pure individual contributor, productivity. Why would I care? I have both news and I have bad news I guess. The good news and the bad news. I think the bad news maybe is that some of the problems we solved were actually a bit Facebook specific. So I think that's not going to be useful to you. But in terms of abstraction and just the ability to almost at the click of a button. The fact that you could actually clone a workflow... I'm going to give you an example. Here's an example. So you're the Instagram ML team. And Instagram's never been ranked before. Instagram's feed has been showing in a decreasing a chronological order. From most recent to less recent. And your task would like, ""Hey, design me a ranking system for Instagram."" That's kind of a tall order. But imagine now that you have an environment where you can actually just look at the production system that ranks newsfeed. There's a lot you can borrow there. And so I think just like the ability to borrow. Whether it's the features that seem to be working the best. The models, the sort of training schedule, the hyper-parameters. All of that is a big thing. In parallel to that you have abstractions. Again, like at Facebook I don't even know how many distinct and mutually incompatible data stores we have. But you can imagine. And the fact that the tool will actually abstract that for you is very useful as well. Then if you have to build a workflow yourself. Building workflows is a pain. If you have to do them from... If you don't have a tool to build workflows, it's just pain. And then another one, tools for debugging and automation. So I'll give you an example of a couple of things. Tut for automation, automatic feature selection. The fact that you have a tool that actually scans for every feature you could possibly use. And then while you're sleeping this is making sure that you have maximum machine utilization. And you're just doing whatever feature selection algorithm you want. I don't care. It doesn't matter. But it's just doing work for you. True stories is ads engineers would come in the morning on Monday. And they would see proposals for new models. And you're like, ""Oh, this looks good."" I get a couple of 0.1. Whatever percent points of gaining whatever metric and that's good. The other one very simple reason ML systems fail is because some data pipeline fails. And again if you have to be checking ad hoc, it's a pain. Imagine that you have this beautiful dashboards with colors and whatever. That just tell you what features are not working. And in which way are they not working anymore. Is it that you have like statistical things. They still produce valid values, but they're the same all the time. Or is it that you get things that are not a number. What the hell is going on? That's super useful as well. Or tools to look at your learning curves and whatever. So these are a bunch of examples of things. Which if you're an ML engineer or you want that stuff. That totally makes sense. I want to make sure we leave plenty of time for the other thread of questions. Which is the new work you're doing as... I think it says on LinkedIn you're the tech lead for Responsible AI. Which sounds like a tall order. I mean, there's so many possible questions here. I was kind of wondering what would be the most interesting. But I think that... I guess the genuine question that's top of mind for me is always walk me through a real decision. Where it wasn't obvious what to do. And by some kind of analysis or thinking about it, bringing your expertise. You were able to kind of guide Facebook to a better decision. Does something come to mind? I'm going to start from the India elections. This was the biggest election in human history with almost a billion eligible voters. So what's the challenge and where does AI come in? Well challenge is that there's a lot of concerns of election interference. Through the spread of information, which is either false or misleading. Or voter suppression or whatever it might be. And of course the way you address this, if you're Facebook or a similar company, is you create guidelines. For what things are acceptable and what not. There's of course, legal constraints as well. And then you just hire a bunch of humans. As many as you can. And you would know about that because you've worked on that in the past. So you have humans who are actually processing a queue of work. And that queue of work is just reviewing posts. But when you have a country the size of India and the volumes of information. Or content that are created every day on Facebook. It's just impossible. You cannot hire enough humans to review even a decent fraction of everything. So it's impossible. So the way you use AI is use AI to prioritize human work. And the way you do this, is for example you train a type of classifiers that we have used. We call them civic classifiers. And what they do is it try to tell whether the piece of content is just a picture of a cat. Which is like, ""Whatever, it does it matter."" Or people like me I'm a runner. So did I post a new run on strata? It's like, ""Whatever, he doesn't matter for the elections."" Or whether it's actually someone discussing something that's actually relevant. Social or political or civic issues. And then at least make sure that that type of content gets coverage. So what's the challenge? We're talking about resource allocation. We're talking about, you have these set of humans that we're paying to protect the elections from interference. And now the question is... And we're using AI to prioritize our work. Well, what happens if your NLP works only for Hindi? Wait, sorry. Could you even back up a second? Because this is probably obvious to you. But it's not totally obvious to me. Assuming you had unlimited human resources to do something, what is the thing you're trying to do? I mean, obviously you're not trying to block everything that's on the topic of an election. Yeah. Apologies. What's the goal? I should have explained that. You will block things if they violate the laws. So if you have... I don't know. Defamation of public figures with lies or just like illegal content. Or reduce the distribution of things that are harmful. So it's both like filtering and reducing distribution of things that violate our community standards or laws. So that's the action that you're taking with a commission of humans and AI. And so the challenge there again, is if you look at this from a fairness point of view. Maybe your definition of fairness is that if we're investing a certain amount of human resources to do this job, that we want to make sure that everyone in India gets protection. From this type of harmful content. And then the question there becomes what does that mean? Because if you think about algorithmic fairness and bias. If you're thinking about using AI to recommend jobs to people. Then you get you into and you're in the US. You think about protected categories, you think about gender, age, race or ethnicity and stuff like that. Where there's anti-discrimination laws that exist. But if you think about this from India you're like, ""Oh, politically, what are the hot areas?"" And then immediately when you work with local people, it's things like caste or religion. But obviously we don't have that data. And it's not clear that... It's not good that we should have that data. So in the end you do a bunch of work and you figure out what can I do? And so we ended up using language and region. Sorry. Well, again if you had caste and ethnicity... I'm sorry. I feel like I'm showing my ignorance here. But if you had those things, what would you do with that? What would be the fair thing to do that you're trying to do? So the challenge with fairness. And that's where we're going to go back all the way to music somehow. Is that there isn't one definition of fairness. If you look at philosophy, whether it's moral or political philosophy. Or you look at the law. Or even you look at the vibrant community in the computer science community and machine learning. Who's thinking about algorithmic bias. One common pattern is that you have multiple definitions of fairness that are mutually incompatible. So you have to pick one. In this case the one that you could pick, is you could say, ""Well, I want to make sure that everyone irrespective of their caste or religion, is going to see content that has received a comparable amount of protection. Against harmful or content that is basically misleading."" Imagine there's voter suppression type of content they read that spreads lies about... I mean, there's even stuff like just lying about when the election day is or whatever. I see. Actually thanks. That's helpful. Then you kind of miss it. Or maybe lying about what a particular politician stands for. Just sort of putting out something that's completely false. So you want to... Go ahead. So I guess one way... Just to repeat back what you're saying. One way would be we want to make sure everyone across groups like caste or religion gets the same level of protection? Correct. By actual humans looking at the content? That's exactly right. Why might that not be the most fair approach? Would there be a different argument for different? Yeah. You have situations where... So here this would be an equal treatment type of argument. Where you would say, ""We want to treat everybody the same."" And equal treatment is I think in many cultures like civil. The first instinct that you have. But you could think about other things. On the one hand you could dial things more towards equity. And inequity you could look at historical disadvantages that some groups might have had. Is there a case where historically some caste and religions are privileged compared to others. And the pressure or the amount of misinformation. If you think about the US, not every group in the US has historically had equal access to voting. And even today, voter suppression efforts are not uniformly distributed. Some groups are actually more targeted than others. So you could actually say, ""I'm actually going to understand whether I should prioritize outcomes for some groups over others."" And if you think about... There's many as many sort of public policies in society that actually sort of aim at focusing more on some groups that have been disadvantaged. I see. And so what did you do? So in this case we went for the equal treatment approach. And then what we did, this triggered a whole amount of work. First of all, we don't have caste and religion. And there's many reasons, there's many risks why a corporation shouldn't have certain type of demographic information. There's a lot of examples in history why it's just dangerous to have repositories of certain demographic characteristics. So what we did is we used reasonable alternatives like language and region. And so we said, ""We're going to make sure that all regions in India..."" And not all languages because it's a huge amount of languages in India. But I think we went for the top... I don't remember any more. Top 15 plus, minus languages were protected. And then you can get into things. How do I translate that into math and code? So you need to look at many levels. One, you need to look at the most basic thing is when you look the data, you look at two things. You look at representation. And then you look at biases in the labels. So representation, make sure that across you build yourself your matrix of regions and languages. And make sure that for each of these buckets, you have a sufficient amount of labeled training data. And then once you're in one of these buckets, you get yourself some ground truth data. And that would be a very long conversation to figure out what that is. But expensive, high quality data that you can use as a reference. And then you kind of measure. You look at the difference in errors that you have in your labeling process across all these buckets. And you want to make sure that you don't have systematic differences. But of course that's not enough. Then you actually look at your models themselves. So you turn your model and you look at things like, ""Oh, in the prediction errors do I have systematic differences."" And one cool thing to look at if you have binary classifiers and you're using... Here you would be using the probability that something is civic content. To prioritize a review. In that context, it's very reasonable to use actually a calibration. To look at the whole calibration curve. And make sure that my calibration curve which maps scores to actual outcome rates. Make sure that those curves look similar for different groups. That I'm not over predicting for one group and under predicting for another. Because if I were over predicting for a particular language, then I would be allocating more human resources to that language. And if I'm under predicting for another, I'm allocating fewer resources, but it's not justified. Because that doesn't reflect the actual volume of content that actually needs to be reviewed for both. But is it... I guess, is it possible that some language has more banned content? And then how can you be sure that... It seems like your model would sort of naturally use that as a feature in the model. And then it would sort of naturally get over index. How do you back that out? I think that's how you... That's where you use calibration. So if you think about a calibration curve, you're looking at how your scatterplot of... You group your scores so the thing is breaking your score at zero to one. Which we interpret as a probability of something being it needs to review. It's true that the distribution of scores is going to be different between languages. Or if one language is being more under attack, then you're going to see more stuff with a higher probability. But what you really want is once you bucket things by score, you want to kind of look within those buckets. What's the actual percentage of content that was violating in that bucket? And you want to make sure that 0.6 means roughly 60% for any language. And then the pieces of content that fall within that bucket is going to be different between languages. But that's not a problem. I mean, and eventually as a result of the distributions being different of scores, you'll end up investing more or less resources in a language or another. But at least you have apples to apples in terms of your risk scores. I see. So you let the model use the language. But then you back it out in sort of a post analysis based on the actual performance. Am I explaining it right? I mean, if you're using NLP and you're building different classifiers for different languages. Then inevitably you're using the language in your NLP model. I mean, having said this, of course we have a cross lingual embeddings and all these fancy things obviously. But you'll still need some sort of training data. The question of whether you should use an input signal or not, is a long and fascinating discussion as well. And I think it is in my view, somewhat orthogonal to many of the ways. In which you would make sure that you have procedural fairness in your classifier. So we need another couple of hours to discuss that because that's actually a very active topic. One of the papers that explains it well as Cynthia Dworks and coworkers. A paper called fairness through awareness. I'm probably butchering the title. There's more to it, but this is the bed. Where if you're trying to be fair across genders, when you're recommending job offers. Should you actually not use gender as an input to your algorithm? Or should you use it? And there's examples that illustrate both positions. So I don't think it's as easy as to say, ""Oh, if I don't use gender as an input to my algorithm, then I know I'm going to be fine."" And the reason is simple, Is that A, you have a lot of features that correlate with gender anyway. But then also if you think about it from a causal perspective, you're going to have certain things you can measure. Which have opposing effects depending on whether you're a male or female. For one, females carry babies and get gaps in their CVs. And so is the effect of a gap in your CV the same depending on your circumstances? That's not clear. Causality actually is probably one of the most exciting lenses on fairness in many ways. But its super early days. Interesting, I guess to ask you another question, that's probably another long question. And this is one of the ones I always worry about with the fairness and AI stuff. I guess, how do you engage with the people who are actually affected by these decisions? It always makes me a little nervous. This idea that scientists go in and sort of get to decide what's fair. And I can see... I can kind of see why. It's important that someone kind of understands the algorithms. That's one point of view. But I mean how did you engage with the folks in India who are affected by this? To even decide what's the fair thing to do? It's essential for AI practitioners to understand that responsible AI is not primarily an AI problem. It's as simple as that. And you pointed to the question of governance. Who should decide? It's not the AI practitioner. It's not me for sure. So how do you... What does that mean in practice? In practice what it means is you build something like a fairness maturity framework. We're building one like this. You work with ethicists, with lawyers on building it. You try to capture the different interpretations of fairness that exists. And what this ends up being is not a tool that tells you what to do. It's a tool that gives you a big menu of questions that you should ask and consider. And then what you build is you build processes of consultation. Where you sort of put the options on the table and then you have a decision framework. Where you sort of weigh pros and cons and risks. And this has been used way before AI. These kinds of risk assessments and decision processes, consultation processes and so on. And one example of this, I think that is quite interesting is Facebook has to build this external advisory board. It's not fully rolled out yet. But it's 40 people if I remember correctly. Who represent all kinds of countries and political views and other types of views. And their goal is going to be in the context of content moderation to kind of look at all these edge cases that are hard. And then come up with recommendations. Obviously they're going to carry a very heavy burden of representing lots of people. But they don't work for Facebook. They're an external body. And I think that one of the... If you want ideas for what to do next after weights and biases, I feel like... Although I'm sure you're going to be busy with this for a while. I think the question of governance in AI... And how to build infrastructure. And this is people in infrastructure for transparency, for accountability, for risk assessments. You see the recent EU paper on AI scratches the surface. By asking some of the big questions that need to be answered for responsibility. I think we're only getting started here. But the thing that I'm most excited about is that AI is going to replace humans in decision making. Across the range of decisions that people make in any domain. And I think most of the time it's going to be a huge improvement. But now all of the sudden we need to go through thousands of years od political science. Or how do societies govern themselves and kind of bring that into AI. So that's a pretty freaking daunting task, but I think this is what we're talking about. And every investment that I see in this is orders of magnitude smaller than it needs to be. When we were last talking, we were talking about an actual kind of case study. I thought that was really interesting. On voting in India and stopping the spread of misinformation. And how there isn't kind of one definition of fairness. And you kind of give people a menu of options, which I think is a really interesting perspective. I guess I'm kind of wondering if you could say a little more about what might be on that menu of fairness? I think it's so interesting when different people have different ideas of what's fair. And actually you say it's not your role to resolve it, but you must have opinions on what feels fair and not. Yeah. Of course. That makes a lot of sense. I think the most important thing to realize first of all is that fairness is a bit of a social construct in a way. It depends a lot on context and it depends a lot on how a particular society has decided to govern itself. So fairness ends up being political inevitably. So let's start to ground this with a very concrete example. So here's three possible interpretations of fairness that resonate both with moral philosophy interpretations. But also with legal interpretations and finally with mathematical interpretations. Because the computer science community is also building metrics of algorithmic bias. All right, so here's the three. The first one could be minimum quality of service. This is also known as minimum threshold and philosophy. And the idea there is that you want an AI for example, to work well enough for everybody. And well enough might mean, if you have a computer vision system that detects people. Or detects faces to be able to put masks on them or whatever. That it works well enough across things like skin tones and skin reflectance, and age, and gender, and other characteristics. That will be sort of a concrete example. It doesn't matter if it works a lot better for a group than another. As long as it works above a certain precision recall for everybody. The second interpretation would be equality. So if we go back to the India misinformation example. One question there could be, if I have some measure of accuracy for my... I think we were talking about the civic classifier that basically identifies among all of posts about cats and dogs on Facebook. What are the ones that are actually discussing political issues? Maybe the political agenda of a particular politician or party. And again to recap, we want to find those because we have limited resources in terms of human reviewers. To look at content and check if they violate our policies or the law. So you want the AI to basically prioritize those cues essentially. This is something Lukas that I know you understand very well. Because you've worked on an onset of human computation a lot. So back to equality in India. Obviously languages and regions have a big social significance. Because they align with religion, they're aligned with caste. And they're aligned with other important sort of social dimensions. So imagine that your civic classifier works well enough for everybody. For all languages and regions. But imagine that it's under predicting a little bit for some language. And over predicting a little bit for another language. So what would happen is that we would be allocating more human resources for the language where it other predicts. And a bit too few for the one where it under predicts. So there, we actually want to have a higher standard in a way of fairness. We're going to say, ""Look, minimum quality of services is maybe not good enough. We want to make sure that we're offering equal protection against misinformation to everybody as much as possible."" And then the third interpretation of fairness, which is sort of widely accepted. Would be to go from equality to equity. And so when we think about equity, we no longer think about equal treatment. We think about, ""Is there any group that deserves special consideration?"" So we're living this in the US right now. Obviously with a big awareness and awakening around racial justice. Where we're obviously paying special attention to the black community in the US. And the reason we're doing that is because of historical structural disadvantages. So if you took this to India, there might be a legitimate question. Some people might ask, ""Hey, actually maybe historically there's been some groups, some regions, some languages in India that have suffered more from manipulation or injustice."" Therefore we actually are going to allocate extra resources to make sure that, that group is really protected. Because given the same amount of disinformation or misinformation, the harm to that group will be bigger relatively speaking. And so these are questions that an AI engineer like me should be asking, but not answering. It's really important to basically escalate those questions to the local team, to policy experts. Find ways to involve external people to give an opinion. So that's what I mean with a menu of options. Each of those translates in math and in code to a different choice. But that choice I should not make neither deliberately nor accidentally. By just picking something that looks reasonable mathematically if I don't understand what the implications are. Do you find that it's easy to articulate those to a nontechnical audience? I feel like you're framing it in a very technical way. Is it clear to people what they're choosing? Help me understand your question. So who would be the audience more concretely? Well, I guess, I'm imagining you're saying, ""Do we want to kind of treat all regions equally?"" In the India example or something else. And then that something else might be we prefer to over predict some regions. And I'm trying to picture... I guess that part makes sense, but it seems like actually, there's sort of this other question. If we wanted to sort of do something that I think is kind of affirmative action. In college admissions I'm picturing. So if you want to do that, actually then you have to get someone to tell you kind of exactly the tuning that they want, right? And I'm not sure I could even come up with what's exactly the fair amount of... The fair distribution to apply. I'm not even sure how I would answer that. Or I'm not even sure how I would ask someone that question in a way that I would get a useful answer out of them. We certainly don't walk around in our heads with exactly a particular distribution that feels the most fair. 100%. I think that's exactly the reason why equity is the hardest of these three lenses on fairness. So I think in practice you'll find that most teams, most product teams, most AI engineers will be either asking questions of minimum quality of service. And if you want we can talk about how to operationalize that. It's surprisingly easy actually. Or questions of equality. Of equal treatment which is conceptually easy but a bit harder to implement. When it comes to questions of equity, these are not really questions that are directly addressed to AI engineers. These are really questions that the overarching leader of a product needs to be a reasoning about equity. I'll give you a concrete example. Adam Mosseri who leads the Instagram team has started to make public posts that you can Google about Instagram and equity. And basically what he's starting to do, he's initiating a dialogue. Where he's saying, ""Hey, we will put the interests of communities above the interests of Instagram."" If we feel that a certain product causes unintended harm to a community. Or that it doesn't serve it as well as we intended, then we will actually stop and rethink it. What does that translate exactly? If I'm running ads and I feel like, ""Oh, ads isn't working for everybody."" Does that mean I shut it down? Do I have a percent? Do I say, ""Oh, I cut my losses at minus 10%?"" We don't need to stay within... We don't need to stay within the Facebook, Inc sphere. I have close friends at Spotify and at Netflix. The same questions occur there as well. And like, ""Hey do we inject some diversity of content?"" Do we allow some producers, some musicians, some filmmakers that are maybe a little bit in the shade? To kind of pop up. And then what's the hit that we're taking in terms of our engagement metrics. In terms of how many songs people listen per day? Or how many movies or shows people watch per day? And stuff like that. I don't think there's an exact science on that at all. But it's a very real question that many people are sort of reasoning about. And the last thing I'll say is that one of the big challenges is a question of governance. And I think you were alluding to that. It's a question of who decides. And if you think about it, we have democratic processes. I live in Mountain View. The city of Mountain View decides where we put bicycle lanes. And of course they're going to slow down traffic, but they're going to create all their benefits. They're going to decide on urban density. On things that are all trade offs. There's obviously... In luxury resorts and stuff like that like in Truckee. I know because we recently bought a house there. And the city council will demand that you reserve a certain part of land and building. For sort of less expensive dwellings to sort of give access to housing to everyone. And in those cases, it's a bit easier. Because there's a democratic process by which that city council gets elected. There's public consultations. If I think about one of the challenges that we're facing as technology companies is this idea of how do we bring in public deliberation? And consultation mechanisms into decisions we make.",8842 +Richard Socher — The Challenges of Making ML Work in the Real World,https://www.youtube.com/watch?v=xa0zQMFS9Tk,3054,2020-10-01,"I love the fact that when you develop really novel A.I. techniques that you can apply it to so many different areas. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Richard Socher was the chief scientist at Salesforce, where he oversaw development initiatives like Einstein's Cloud A.I. as a Ph.D. student at Stanford University. Richard helped create the ImageNet data set and also founded the Start-Up MetaMind. Recently, Richard left Salesforce to create his own startup, SuSea, which is on a mission to build better Internet using the latest NLP and AI technology. I'm excited to talk to him about all these things today. I was curious how you got inspired by this AI Economist paper and project. I mean, I was trying to read it and I'm not an economist, so I had a whole bunch of basic questions that are probably pretty embarrassing but when did you learn so much about economics? And it's such an interesting idea. Maybe you should probably summarize the paper first for those who haven't read it and then talk about how you got interested in it. Happy to. Yeah. So, AI Economist is essentially framework, it's more than just a single model or something. It's a whole framework that tries to model an economy and in sort of the most simple forms for now. Though it'll get more realistic, I think in the next months and years to come. And then inside that economic simulation, you have a 2-level reinforcement learning setup where you have an AI economist that basically can set taxes and subsidies and other kinds of financial instruments in order to optimize an overall objective for the economy, namely, in our case, productivity times equality, where equality is measured as sort of a 1-Gini index, which is a measure of equality that's used worldwide. And productivity makes sense in terms of how much do all the single agents in the simulation make? Each single agent is also a reinforcement learning agent, but their goals are just to maximize their own objectives, which is to maximize their own income and wealth. And is that realistic? Of course not. Also in the simulation, there's mostly just three different types of resources, wood, stone and space in some ways. Then agents walk around in this 2D grid world. They can build houses, they can block other agents by building these houses, and they can trade resources as well to try... You need certain more wood to build the house, but you have plenty of stone, you can trade it and so on. So it's some of the fundamentals. Also, you have utility curves, which is quite common in economic modeling that you wouldn't have in the game. What does the utility curve do? It tells you, for instance, that after a certain amount of work you have diminishing returns. You could work seven days a week. But most people at some point want to actually take some time off and not spend all their time just to minimize another little bit of money. That and a couple of other things make it quite different to playing just a game. We thought about this, too. Could we just use civilization or age of empires or some of these things? But we wanted to, one, steer away from this zero sum war games where you train and just get really, really good at fighting each other and instead try to think of a system to try to have an overall improvement for the world so that if that system actually gets deployed, it would have as is a positive impact versus, oh, we used it to develop interesting technology that eventually maybe will have this positive impact. So that's kind of what AI Economist is. What's interesting technically and hard about it, is that with this 2-level reinforcement learning, essentially the AI Economist keeps changing the goalposts for all the RL agents and they say, oh, I found this great strategy. I'm going to sell this, trade this, collect these resources and build houses in this way to block off some other person. And then all of a sudden the AI economist changes because they realize you have a monopoly and equality is suffering and maybe you're going to tax the person with the monopoly and I have blocked all the resources away from the other agents. And now all of a sudden, the agents have to adapt too and almost all RL before, you have a fixed objective function. You know, this is how you win gold or is how you win chess, this is how you win lottery games and so on. They don't change, and here the goalposts keep changing so it is a really hard, interesting optimization problem. So that's kind of what the AI Economist is. Now how did we get to that idea? It actually came from a couple of different strains. The first time I had this idea was during my PhD where this idea of essentially all these different cultures in the world have had their different energy landscapes on their optimization strategy and a lot of them were trying to optimize roughly similar things, you would hope. People want to prosper. People want to have certain amenities and freedoms and so on, but they all end up in their different local optima. And I thought about this as this non convex objective function that different cultures go in and try to optimize it and end up in all these different local optima. And so that was kind of the idea but I didn't quite know how to structure it as an AI problem. I just had some sort of quick little notes and I drew some objective function and I kind of just started and continued to do NLP research. And then we hired Stephan Zheng, the first author of the AI Economist and we've also had Alex Trott in the team already for a year and with him, we were working on trying to build houses just from lots of bricks, reflect resources we have a 3D agent, something like Minecraft, that tries to build a house. And then house building turned out to be pretty complicated but the goal of that house building project was to eventually have multiple agents and a whole island and the whole thing. Then we realized, man, we could spend another two years or three years just trying to build the houses properly before we could get to that AI Economist and thought, Stephan, like, hey, why don't we just go directly, assume the house building is just one action. Build the house in this location and that's it, rather than all the different 3D structures and so on and finding out what is a good structure for a house, just one thing and then eventually maybe we can merge into projects but then Stephan actually did a phenomenal job deepening our understanding of the literature and economics and reached out to other economists and that's how we worked with David Parkes in the end, from Harvard, and really fleshed out how to make it work in an RL framework and then getting all the complex optimization going and so on. So I have to give a lot of credit to Stephan. And Alex eventually said, you know, screw this simple house building. This is why I'm wanting to do this anyway. And he's been interested in economics for a long time, too. So the two of them kind of just jumped in on that project and it became really great. That's so cool. I thought there could be a lot of debate on the objective function, right? You did the economic growth times one minus the Gini coefficient. I mean, why not economic growth plus the one minus Gini you know? I can think of many other ways to do it was that... And so could we, and I think eventually you're going to... Like it. I love this project in that it literally covers the whole spectrum from like a hard core optimization problem that's really technical and sort of min-max and shifting objective function and it's landscapes and so on, all the way to the most philosophical civilization level, debates and questions of what we need to do in the world and what is economics and politics and what are we all optimizing and should it be quality? Like in some ways you don't want the Gini index to go all the way to its maximum either, because that means absolutely everybody is forced to be 100% equal, which is questionable in terms of monetary things. Of course, we should all be equal in terms of rights and opportunities and so on. But in terms of financial equality, it's an important thing to point it out. Yeah. In fact, I think actually this kind of work will help with that kind of equality because we can push for that and improve it a lot. Does it mean we have to get to the maximum rate? Maximum would be like infinite productivity and nobody has any difference in the world and I think we should also celebrate some types of differences. But I think economic inequality is, in my eyes, the single biggest issue that we have in the world right now. A lot of issues fall from that. If certain minorities would have more economic equality, they would be better off. I think we'd have less racism, less sexism if people of color and women had the same financial equality as men do, statistically speaking. I think economic equality is a big part, like a lot of wars get started from that. A lot of genocides and so many other issues happen from economic inequality that it's a really tough one. Now, should it only be productivity times equality? Maybe not. Maybe there should be other things like sustainability in there. So in the simulation, you have trees. The trees will eventually regrow, but you can have a tragedy of the common situation where all the agents just get rid of all the trees and then there are no more trees and they all optimize their thing. Everybody's equal. But then, long term, everybody will suffer because there will be spaces and people won't feel anything anymore and it will flatline because they destroyed all their resources. So I think sustainability is a reasonable one, and then there are interesting questions, clearly, utilitarianism isn't I think, at least philosophically, the only answer to this. So you may need to have other protections in the objective functions and some boundary conditions. We could talk about just that for hours, probably, and over drinks in the evening. We could really spend a lot of fun philosophy and ethics of what what we should optimize. I think what I'm hoping to realistically, though, is that when a politician in the future would say, I want to help the middle class, that's one of the things I want to do. And then eventually, either right away during their campaign or later on, they propose to say, now, this is what I'm going to do to do this. And then you run that set up through the simulation and you say that is really different from any of the potential solutions that the simulation would get for helping the middle class. Why aren't you... Why does yours differ so far? What's your thinking about how that will actually help more than these other ways? And so hopefully we can agree more easily on the objective function and then we can disagree less on how to exercise and how to get there. In your simulation, was there any emergent behavior that surprised you? Is there anything counter-intuitive that you discovered from doing these experiments? Yeah, there's definitely some things that at first you're like, wait, this doesn't make sense. So we have taxes and we have subsidies. And when you look at it, it's actually the lowest income bracket got taxed a lot and then it actually went down for this sort of middle class of the simulation. And we're like, wait a minute, that seems very counter-intuitive. But it turned out they were also getting more subsidies. So they were actually much more positive because the subsidies were also given to that income bracket. But that at first was kind of counter-intuitive but then once you double clicked on and you realize, well, effectively they're actually getting more subsidies than they have to pay taxes. So it kind of levelled out and made more sense. Interesting and sorry, this is I guess maybe a personal question, but you grew up in eastern Germany, didn't you? I did for a couple of years. Ethiopia for three years and four years or so of East Germany and then Germany Reunited. I see. Do you feel like that gives you a different perspective maybe than Americans on these topics? I mean, I think I was still pretty young when the Berlin Walls came down. So I think, though, culturally, of course, East Germany still had... It wasn't like Germany Reunited and then there was like no more differences between East and West. In fact, some of the issues you see in other countries like you would see between east and west, like the east, still has lower income compared to the West, and a lot of women actually left East Germany to go to the west. So there are a couple of counties that have like too many men and so on. So there are still a lot of differences between the east and west still to this day. But I think growing up in Germany overall, which is where I got most of my education was in Reunited Germany, probably did affect me. Like in general as I grew up, it was free health care and free education all the way down to or up to a PhD level, Masters level, was just not ever a politicized question. It was just a given and being sort of anti big military intervention was something that was still pretty deeply ingrained and Germans have been in there twice for a century. It was clear that that's something we should all try to as best as we can to avoid. So it was just a lot less sort of pro military conversations going on in Germany. And so, yeah. So it's kind of interesting. The whole political discourse in Germany is kind of shifted, even the most sort of liberal, pro economy, pro companies, types of parties and on the political spectrum, none of them would ever question free health care or free education, because statistically speaking, it just helps everyone. It's kind of an interesting definition of freedom even now. Is it more free that you always have health care no matter what job you have, or is it less free because you have to pay for it? Like it's interesting, interesting cultural differences. So that definitely had a little bit of impact. Got it. Yeah, that makes sense. I mean, I guess I also want to make sure we cover some of the other papers that... Since I have you, I have just so many questions. I was wondering if you could talk about.. We've actually been you know.. My company has been working with a lot of people, doing various aspects of protein generation and folding. And I really feel like there's something going on in ML right now with all the applications. And it's something I know very little about because it didn't feel like a topic when I was in school. I'd love to just... If you could just describe I mean.. It's such an intriguing idea that language modeling techniques could be used for Protein Generation. Maybe you just tell us what you did and kind of what you think about the field in general. Sure. Yeah. So generally, high level Protein Generation or the ProGen model that we published is a language model. As a language model, it's basically just trying to predict the next word in a large unsupervised text. So take all of Wikipedia, as much of the Internet text as you can, and some people innovated by taking Reddit data, which is kind of more interesting, but then also has issues with bias and so on; and you have a very simple objective function for a large neural network architecture just to try to predict what the next word is. And people have been doing that for many, many years, many decades, because it's a very helpful way to disambiguate words that sound the same, but actually a written and mean something different. So if I want to say the price of wood versus would you tell me something? Then the wood sounds the same in both sentences, but in one it's the wood of trees and one's an auxiliary verb. And so you basically go into disambiguate which one is more likely in that context and so that was used for speech recognition and still is in a lot of speech recognition models for translation you know, you can try to generate a bunch of possible translations for German to English translation, but then try to identify which sentences are the most fluid, fluent for the English language. And so the interesting novelty that came out recently with GPT and Open AI is to actually take these existing models to make them even bigger and not just look at the perplexity numbers going lower and lower, perplexity is essentially sort of an inverse metric of the probability that you assign to each word. So the less perplexed you are, the more correctly you've assigned probability mass to the word that actually comes next. And so as the perplexity reduced more and more, you cross the threshold and Open AI was clever enough to realize the threshold is so low now, we should really look at what they're generating and see what comes out and it turned out that they're actually surprisingly good, better than most anybody in the field had thought five, ten years ago in generating fluent paragraphs that actually made sense, that had some coherence and flow to them and of course, after one or two paragraphs, they will repeat themselves and won't make that much sense still, because they don't have this, I think is actually an interesting question for the future, what's the next objective function? Like just producing the next word and generating the next word isn't going...that doesn't include the fact that usually when you try to say stuff, you have a goal in mind to convince somebody of something to learn something, to get a message across, to get somebody to do something. All these different goals that you have as you try to use language; that I think will be the next level of AI research to identify and understand new objective functions in general and that actually allow AI to come up with its own objective function but that's.. Anyway, so back to ProGen we have.. This is fun like usually I don't have that technical of an audience to geek out and about these things. It's just that I have to stay more high level for most other interviews. And so what's cool about ProGen is we took this idea of predicting the next word, which for languages makes sense. Humans can do it, but humans can't actually do it for proteins. We're not built to look at a bunch of different amino acids and protein sequences and then try to learn what they would look like, what would come next and so I love the fact that when you develop really novel A.I. techniques, that you can apply to so many different areas and I still think that one of the most exciting things is if you find a new model family and then you apply it to all these different things and you show and eventually have a multitasks model that can do multiple different things. So here it made sense to us because, again, it's a language that has a meaning, we just have much harder way of accessing that meaning and we have a ton of training data now that sequencing is getting cheaper and cheaper. There's also an interesting sort of learning about you... The first time somebody got sequenced was incredibly expensive and that was like a white man and now like you can for one hundred bucks, anybody can sequence their technologies, so that's actually a great, great story for technology. So long story short, we predict each new protein one at a time, and then also generate new proteins. And so what does that mean? Why would that be useful for people not familiar with proteins? Everything in life is governed by proteins. Every function in your body is governed by proteins. Deep down and the level below cells and everything, it's all guided by proteins. The digestive system, even you could develop proteins that will fight certain types of cancer, certain types of viruses that's actually something we're also working on now to try to do some interesting things for curing certain kinds of viruses but it's too early to talk about it right now. It will take some time. It's kind of another moonshot. But there's really exciting work that you could do to develop proteins that will need plastic to try to help with pollution. It is unlimited the kinds of stuff you could do with proteins if you understood that language well. One big important factor for this protein model was that it's also a controllable language model, that it has these control codes in the beginning because you don't want it to just sort of randomly generate random proteins. You have an actual goal in mind like this should bind to this binding site in a cell or this should try to be able to connect to a plastic or all these different kinds of things you could do. We have these control codes. They basically give you the function and which area of the body and things like that it's in or what other binding sites you should have, and then it will actually generate reasonable proteins. And Ali, on our team, has just been doing a phenomenal job pushing that line of research. He's also the first author of ProGen. Of course. How did you get the training data for that, like, I could see how you could get protein data from DNA but how did you get the data of what these proteins do and things like that? So we took a subset of data and that had some kind of metadata associated with them. What's interesting is there you actually can look at a lot more tree data once you just say, look, any control code goes, it just goes in here and then we can also use that. The majority of datasets are still very unstructured. There's no good documentation and coherence between these different datasets. Each different dataset in on the proteins are of different lengths and then some people say, oh, it has these three functions and other people say, well, I just got this from somewhere. The next level is actually to try to train it, even if you have zero metadata associated with them. There are some interesting meta studies that have a lot of unsupervised sequences from soil and all kinds of things, and so if you could learn from unsupervised sequences, you could train them even more but for now, we just took datasets that had at least some kind of metadata associated with it. Even if there is no general nice Imagenet-like taxonomy or Wordnet-like taxonomy for them but any kind of metadata was enough for us to incorporate the data into them. And was this the same with GPT where it's just like predicting the next one and you're just trying to have the lowest perplexity or the highest probability? That's right. It's a super simple objective function still, and it's just trying to predict what's the next one, and what's amazing is actually that, and we just released this on Twitter today and on the blogpost, we analyzed it and some super fascinating stuff. So there's protein folding, which is computationally expensive, really hard problem. But what we found is that even though the model goes through sequences just one at a time, you can visualize the attention mechanism inside these transformer networks and the attention mechanisms actually have a very high correlation to folding patterns in 3D. So the areas of the protein, when they fold around and they actually are close to another area and they would often fold in a way that they're very close and then also different binding sites and so on, they're highly, highly correlated with transforming attention. So there's I think there's a lot more there to find out and explore. Were the same mechanisms like attention that make language models work well, was it the same things that really mattered for the protein prediction, or was there any difference in the kinds of models that worked? So to be honest, these models are so large and we don't want to burn through hundred million dollars to train ten of them. We just trained tiny samples of a transformer and then we trained very, very few one or two of the 1.7 billion parameter with 230 billion or 70 billion or so protein sequences. I see. Sorry we don't have a huge tabulation table where we're like we spend 100 million dollars and not one piece of paper that gives all the different numbers on a big table. These models are so large you really better not have a bug in that in the beginning and then realize it more later. But I guess, do these larger models work significantly better than simpler models? For sure. Yeah. This is really where neural networks shine. They have so much more expressive power. They can capture so many more different, non convex, highly complex functions that you need that. I think this is sort of where you couldn't do this thing with a linear, linear model, the world is not linear. It would be a lot easier to solve all kinds of issues in medicine and so on if everything was like some nice convex problem in biology but it's far from that. So we really need the complexity of these very large models. Do you have any way to... Like I feel like that.... I feel like GPT2, one of the coolest things about it was it produced a sentence where they're so evocative and you decide that, OK, this thing is not going to long range dependencies, but it's very fluent, you know? Could a scientist look at the proteins you generate and have some sense of these seem fairly realistic, or is there any way to measure that? It's super fascinating, right? Like what is the energy landscape in this discrete protein space that actually makes sense when you look at. So biologists already do this like two years ago. I think that the Nobel Prize in medicine was given to a team that essentially randomly modified existing protein sequences and then just tested them out. You can synthesize them and see if they actually have certain properties, like you can try to make them fluorescent or not, and then see like how many proteins can I randomly change to still have that property or I have even more of that property, which could be useful for for drugs and drug development and so on. And so you don't want to steal usually too far away from it. And then there are a couple of different metrics you can look at of as you generated a new type of protein that doesn't yet exist in nature. How likely would that to be structurally sound at all? So and we actually in the paper, we have different experiments where we show that there's a certain energy's program said you can compute that says like this would actually have a very low or very high energy and hence this protein would not just disintegrate and fall apart. It would actually be structurally sound. And it turns out that compared to the random baseline, which is relatively easy to beat, we're so much more better and create much more stable proteins that are more likely to actually work. That's so cool. I'm going to keep jumping around because I have so many questions I want to get through but I'd love to hear about the language model that you came out with last year, the CTRL and what inspired you to make a new language model like what it does differently than other options out there? Yeah, it's a great question. So Control is essentially a controllable language model where instead of just saying, here's a beginning sentence now just spitball like randomly generated, like how that could continue. Usually it would make more sense for us to try to create language technology that we have a little bit more control over. So we created these control codes that will essentially say for the sequence, but also given this genre like continue to sentence. So if you start with a knife and you say the genre is a horror movie, then the knife peeped through the door and a lot of crazy stuff is happening but where we say a knife and a review story, then it's like, oh, the knife cuts really well. My vegetable, my husband loves using it in the kitchen and blah, blah, blah. So that's a difference. You have more control over what it would actually generate. Control codes can also be used as task codes and you can say the task code or control code generates the translation of this and then it generates the translated sentence after instead of just the next random possible sentences that might make some sense. And so at that point, this has been something I've been trying to work on for a long time with the natural language processing decathlon decaNLP and a lot of other projects. I think we're at that state now in NLP where we can try to just solve a lot of the standard NLP problems by having a single, large, multitask model that you have the substrate a large, complex neural network structure. It almost doesn't matter anymore these days what it is like. You could probably have a very deep large stack LSDTM now it would be a transformer will probably come up with other versions of that, but some kind of large general function approximator, some neural substrate and then the novelty is you try to train it to have all these different objective functions, different tasks. It gets better over time and then you can get to transfer learning tasks, you can get to zero shot abilities and so on. So that's been a dream of our first line of that work with Bryan McCann on Contextual Vectors Cov which we trained back then, still with translation. Then Elmo took that idea and replaced translation of language modeling, which is even more clever because you have even more data that's unsupervised than you have with translation, and you know that's sort of the biggest supervised dataset like ImageNet and then Elmo, of course, became BERT with even more novelties on top of it but still sticking to language models and taking these contextual vectors. And so when you have contextual vectors that can get easily fine tuned on multiple tasks, then you have something like decaNLP where have everything is described as one task. Then you get closer and closer to that step to eventually just having a single model for all of NLP. And then my hope is that eventually the NLP community can work in a conglomerate kind of cumulative way where we have a control-like language model or question-answering model that you can ask any kind of question and so on. Or you can even have just a general language model, but you ask it questions and then the next words that come after the question should be the answer if it really learned something about language in the world and everything. So that is equivalent supertask of NLP. The long story short is if we're able to do that and every research that we do actually makes an existing super model better and better, then we would all of a sudden have an explosion, I think in progress in natural language processing. And we would stop saying, oh, yes, this Paper has a baseline and we're making a little bit better. And then the next paper, we jump back to the baseline, we make it a little bit better in a different direction. Next time we improve our baselines from time to time but all these papers do sort of one off improvements of these baselines versus everytime somebody publishes a good paper, the model overall gets better and then everybody will start directly from that improved model. So that's been my dream for the NLP field for a while. It does kind of seem like NLP is moving in that direction, doesn't it, with the big multitask baselines? That's right. And T5 and all these other large models. I'm super excited to see it. I think it's finally happening. It'll still take some time because just like... About ten years ago, I had my first deep learning and neural network paper at an NLP conference and the reviewer still wanted to reject it, a lot of people were like, why are you applying neural networks to NLP? That stuff from the 90s, it doesn't work here. And I had like in the beginning of my PhD a lot of papers rejected. And I think part of it is that a lot of people built their careers and their knowledge and their academic standing and so on, on feature engineering. And so when you say, oh, you don't need to do feature engineering anymore, you just now have these models and they learn the features. It doesn't sound that great if you've done feature engineering for 10 years and now we have the last 10 years or so of people doing architecture engineering, and they don't want to hear that the architecture doesn't quite matter anymore. It's now about the objective functions. And so let's ignore all these architecture engineering papers that just assume there's one very large, efficiently trainable neural network architecture, probably a transformer because it's paralyzable on GPUs nowadays, but it could be LCMs or whatever. And we train this really large one. And now we become clever about the objective functions on improving that neural substrate. Again, it will be a shift and it usually takes the community a couple of years to make these shifts and young people are jumping on it. And then people that are older and have been longer in the field, will eventually kind of through their grad students and so on, adjust and then embrace it and then start doing amazing work in that new area. So how does control fit into that? It sounds to me like it was that a new architecture or was that really just adding control codes? It was mostly adding control codes to a large language model? That was kind of the main idea. And it fits into this as a way to unify. Basically, the way I see it is there are three equivalent supertasks of NLP; dialogue systems, language models and answering systems. You can cast every other NLP problem into any of those three and you can map those three between one another. Like in the dialogue, something happens and then you have to generate the answer to what the previous agent just said and the language modeling, you can also cast it as question answering by asking a question. And then the words that should be predicted after that question should be the answer. So question answering in language modeling are equivalent. So we tried this with decaNLP where we had language modeling, where we used question answering as their default framework and with control, it's the acknowledgement that if you start with a large substrate that can be trained unsupervised from a large amount of text, it's sort of the best single task to then transfer from and do multitask, learning from. Do you treat the control codes differently than other tokens? Because I feel like I see a lot of examples where people do a translation by just showing pairs and then it's like their language models are such generating pairs. Is that the same thing or is it somehow controlled doing things more systematically? So you do have... In some cases you can actually make those control codes be a language themselves, right? So you could say like, here's a question. Now is the text, generate the answer after you've read that whole thing, but you can also have control codes that are just like task_1 is a control token and then it will... What's surprising is with these control tokens, the outputs will be very different, right? Like translational all of a sudden, generating very different output with the same neural network architecture overall. That's.. It's pretty amazing how that works. It's amazing how that works. I mean, I can't believe it though. Like I remember when I was studying this stuff, it was like, you know, it was like linguists that wanted to do it completely, explicitly and rule-based versus like, you know, people learning machine learning. So I guess you sort of keep going up levels of abstractions. You know what's interesting? Sometimes these rules, I used to just so much discard it that when you try to build a real system for a real company, you have a chatbot. And that company has in the end, like everything, if you have the ability to make it into chatbot, you have some API somewhere, right? Whether that API is like you click on these fields or you already have it as a program, it needs to be a structured disambiguated programming language output at some point that fulfills actions like ""what order do you want?"" We go into our order management system and operate this field and resend a new one that goes into some Logistics center. And so when you have these concrete chat boards for a company, I was always thinking, oh, it should just all be learned and then at the very end, they generate some code and so on, but the truth is, companies want to sometimes have control. They want to say, yeah, maybe there was like bad biases in my past training data, or maybe we changed a process and now we don't want to do it the way we used to do it. It's going to be a new process or like in this country, we have some regulations so we need to first ask this other question that wasn't in the training data from that country and so on. So then I'm surprised how often when it comes down to real business and products, you have to still have these rules in there. I'm curious, actually, so you've gone through this transition from mainly academic to startup founder to you're big company, like C-level executive. Has there been any other surprises like that, like seeing how businesses think about machine learning versus how academia thinks about it? Yeah. One thing I love is that I actually still dabble a little bit and all the other ones. And obviously we're still doing fundamental research now, but also now lots of product and stuff. But there are a lot of interesting different mindsets. In many ways, if you have a domain problem, this is actually something you see even in research, if you work in just biology or you're just trying to solve one particular domain problem in a particular modality like language revision, then it's rare. It's really hard for the people working on those applications to also find out new architectures. It's just a different mindset. You're trying to solve the problem. Like if you try to, for instance, help babies in the ICU or you try to cure COVID or something, you don't care. If you can do it with naive base or an SVM or some like latent allocation or whatever the model used to be the popular model at the time, it doesn't really matter. You solve like cancer or some specific type of cancer. And so it's interesting, you are starting to throw the kitchen sink at applied problems and that's sort of still true even for applied engineering teams. They say, you know, by the end of this quarter and this spring planning and so on, you've got to have a solution that works on some level, whether it's the absolutely latest greatest and like really squeezes out those 2% that depends on the business model. Like for Google, it makes sense to spend a lot of time on AI because they have clearly certain AI metrics like recommender systems for advertisement and so on, where an improvement in an AI metric results and immediately more revenue. That isn't the case for every AI problem and solution product out there in the B2B world. Sometimes, it's just like if it works, you make the same amount of money than if it works 5% better. What are the things that Salesforce cares about? Like what are the ML applications that are really important inside of your company? So there are a ton. There are roughly different groups such as packaged applications that you can sell as is like a chatbot application or opportunity or need scoring. Some of these sometimes go into a second category, which is quality of life and user experience where you just make the product a little bit better and you have a lot of those... Wait sorry, what would that be like? Like make the product a little better? For instance, you type in the name of a company to create a new lead object as a salesperson and it just finds the company's logo just like boom! And now it looks better in a nice table. This is not a feature you could get money for, or the search functionality. Search is one of the most used features in most CRM software. But spending another billion dollars on improving Search is questionable, because you'll make the same amount of money everybody assumes with Search it should just kind of work and you don't pay extra for it for the most part. So you have packaged applications where you clearly make a lot more money, like a recommendation engine and commerce platforms with one of the largest e-commerce platforms in the United States, which many people don't know because nobody goes to Salesforce.com to buy their shoes but you go to Adidas.com, which runs on Salesforce. And so there you have recommendation engines as really sort of almost an obvious kind of task. Everybody knows you should use recommendation engines and e-commerce, but those are sort of packaged applications that you can sell as is. Then you have this quality of life features. You have things like you want to improve your operational efficiency, like make recommendations for your own salespeople or learn how to work with your data centers more efficiently and things like that. And then we have in the company also a lot of platforms where we enable our hundreds of thousands of customers to build their own AI applications with their own data without us interfering. And so there you also have interesting problems because you have to not just build one app, but you have to build an engine such that a lot of admins with low or no code, you can create an AI application, some prediction model, some recommendation model, some OCR model to read and forms from some complex form directly into digital form, which is surprisingly still necessary a lot of times these days. So there's so many different applications. That's why it's so exciting here. How do you even... Like in your team, how do you take lots of work on? Is it by research interests? It's a complex process. I'm wearing these different hats and so on the research side, we go mostly for impact on the AI community as a whole. So that's one of our objective findings; impact in AI research. Another one could be impact down the line eventually on products. So we have things that we work on in medicine where we don't currently work in medicine, but maybe down the line that could be used. We have things that we work on in the AI Economist or ProGen where maybe eventually the world will improve, but it's not really clear. So there's sort of pure AI research impact on the world and all our stakeholders and the community and so on then on real products. So a lot of natural language processing research is surprisingly close. So you can do some fundamental research and semantic parsing, learning how to really disambiguate a sentence into a query that could be used to get the answer from a database. And that is fundamental research but it's also pretty applied and could be used for Tablo and a lot of other exciting areas inside the CRM where people need to find answers in the database. So that is kind of the two different worlds on the research side. Then on the product side and the large engineering groups, it's very customer driven and sometimes it's driven by what we think the future will be like. So we announced, for instance, at Dreamforce last year a first agent over the phone. So an agent that you can just pick up the phone and have a natural conversation with. So Mark and I were on the stage and showing sort of what that would look like. So that is obviously something that maybe customers aren't even thinking yet about because they're not sure it's even possible. But we're working on those kinds of things because we think it will be possible soon and we're now making it possible. Cool. All right. Well, we're running out of time. We always end with two questions that I didn't warn you about but I'm curious what you'll say. So the first question is, what's something in the aspect of machine learning that you think practitioners are not paying enough attention to? I think now that AI has reached that deep impact level on the world, you really need to think about the biases in a holistic way, the systems, the people, the structures that are using AI for something. Are we thinking enough about the bias? And as AI has a bigger and bigger impact on people's lives, I think the bar needs to increase more and more. Like a loan application AI that decides who should be able to start a business and so on, you really need to pay a lot of attention to the biases in the datasets, the biases of how those datasets are created by the people, the hidden agendas and what the status quo is and so on. How do you improve the world in the end versus entrenching it in the current system and just keeping the current system the way it is? And I think that's sort of something that a lot of practitioners still need to work on. And now also more researchers need to work on, because even when we play around with like, oh, that's just a cute little artsy research project, right? E.g. Depixelization, it turns out it's another deeply-rooted bias that is there and that gets exposed in that and I think we should all work on. Do you have any suggested reading material for people who want to get more educated on the topic, where you would point them? Yeah, for sure. I think Timnit Gebru right now is really one of the leaders in that area and she has given this great tutorial on CGPR, the slides are online. They're a bunch of papers from a lot of other people. In our team, we also have Kathy Baxter and she looked at more of the applied side of AI a lot more, making sure that AI systems are explainable, transparent, that you have feedback loops in them that people can give feedback to. When an important decision about them was made in an automated fashion, they think it's wrong that they're able to fix it and sort of escalated to humans or improve that data, making sure they're explainable. You actually understand how it came about that this decision was made about you and so on. So there are a couple of different things, like making sure. Even though it sounds kind of crazy but I think we need to even think about human rights when it comes to applications that we work on. And so I think Kathy Baxter has a lot of materials online, interviews and materials. We have some trailers also on the Salesforce learning platform on ethics and ethics in AI in particular. And then Timnit Gebri has a lot of great materials on research in AI and the systemic issues, as well as other concrete issues. Cool. Yeah. We'll put this in the notes and totally agree. The final question, so you're coming from a research perspective but you're at a company that does lots of applied machine learning. When you look at that path from taking a research thing to, you know, deployed inside the Salesforce product, what's the biggest challenge that you see in that process? Like, what things do you think get bogged down the most? Boy, it's interesting. I feel like we're finally really getting into a groove and we're getting a lot of features out much, much more quickly than we used to. I think part of it is just that the two different sides of the pure researchers/research engineers/data scientists/data engineers, they have a certain way they see, where's the complexity of deploying an actual AI product and then you have engineers. The truth is, though, that somewhere between 5% - 20% of an AI product is actual AI and then somewhere between 95% - 80% is just relatively standard, but still very hard software engineering. Everybody can nowadays quickly hack together a quick TensorFlow image classifier, right? It's like, oh, and you feel like after 10 minutes you're an expert and so cool and you're super smart and you know AI now and so on but then when you actually want to deploy that in a large context, now you have load balancing, security, you have privacy and you have all these issues. Now somebody from GDPR from Europe says, I want you to delete this picture. Now you need to retrain it. If that happens every day, are you retraining a whole huge model every day because somebody are asking to take their data out of the thing that eventually fed your classifier. How do you update the classifier continuously? How do you make sure that as you update the classifier, if you've had something like FDA approval or HIPA compliance, how do you make sure that the new classifier is still compliant with all the various regulations you have? So there's so much complexity on the engineering and productionizing of AI and that is sort of what a lot of people who are super deep AI experts often underestimate. Cool. Well, great. And great talking to you Richard. Thank you so much for doing this. Pleasure. Great questions. It was super fun to geek out a little bit and go deep into some of these papers. Totally. Yeah. Thanks so much.",8760 +Zack Chase Lipton — The Medical Machine Learning Landscape,https://www.youtube.com/watch?v=zV-wd1iSSSk,3592,2020-09-17,"Maybe it's true that theory is somehow more philosophically interesting than just benchmark applications, in just empirical pursuit on methods. But the application is a different axis. I actually think that the applications are super philosophically interesting. I mean, they force you to ask ... Because they actually ask questions that aren't just mechanical. You have to ask the normative questions. Right? The thing that I think is exciting about applications is that nobody told you in the first place what is worth predicting. That, by itself, the convincing someone that this is actually a problem worth solving. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the railroad. I'm your host, Lukas Biewald. Zach Lipton is a professor of machine learning at Carnegie Mellon university. He has an incredible number of research interests, and it was actually hard to research all the papers that he's been working on, prepping for this interview. I'll give you a couple of topics that we might cover today. Robustness under distribution shift, breast cancer screening with machine learning, the effective and equitable allocation of organs, and the intersection of causal thinking with messy data. He's the founder of The Approximately Correct blog and the creator of Dive Into Deep Learning, an interactive opensource book drafted entirely in Jupyter notebooks. Couldn't be more excited to get into it. I have a couple of your papers that you flagged that I'd love to talk about, but kind of before then, I kind of wanted you to catch me up. I feel like last time I knew you, you were applying to grad school, and now you seem like a successful professor with a lab at a very famous school. What happened, Zack? Yeah, it's been a weird ride. So when we met, it was in San Francisco. And that was like ... I had already made this weird decision to go and do this tech thing and live in California for a year, get into grad school. But before that I was a musician, so it was even a bigger jump. I think it looks more planned or directed now than it was at the time. The guiding thing to get from being a musician to being a PhD in machine learning was just a recognition that I wanted to be in PhD. I had enough friends who were in the sciences that I sort of knew that maybe the sorting hat got it wrong or something at some point. And I didn't even know what modern machine learning was. It was really guided by a kind of, I knew I wanted to be in a certain kind of scholarship, and I wanted to be in a certain kind of environment. And I knew that meant going to grad school. And then sort of looking like, all right, I was an old man for a first person starting on a scientific career. So it was like ... I wasn't going to do a wet lab thing and spend 10 years learning how to like pipette because it was too late for that. And I had had just enough of a connection with computer science earlier that I knew that was something I enjoyed doing. But I don't know. It's kind of weird to look back. I mean, I think in terms of from where we met, which was I kind of knew almost nothing. I was just kind of wanted to go to grad school for machine learning. I think the biggest thing is, is that I entered the field at the moment of a really great leveling event. So the sudden rise of deep learning was an unexpected thing. And I think it would be an exaggeration to say it completely wiped out people's skillsets or whatever from before then. But it certainly opened up a path in research, where at least the next two, three years of steps in that direction, or a good chunk of them, didn't really require that you were like ... If things were just progressing normal science, and it was like kernel machines were dominating, for me to get to the point where I was a world leader in understanding nonparametrics or something, that wouldn't happen in like three or four years. But entering a field where suddenly everyone is doing deep learning and there was kind of like a wild west type environment made it very easy to sort of pick an area, say ML and healthcare, and very quickly be at least like ... Now, the new generation of technologies, be one of the leaders applying deep learning in that. So I think I got lucky that I sort of entered at that moment of transition where it wasn't so disadvantageous that I wasn't an expert in ... I wasn't a great engineer, and I didn't necessarily have all of that mathematical background, but I was able to sort of ... One advantage of it is I didn't have a lot of commitments. So I wasn't committed to a set of methods that I had invested years in reputation and getting them to work. So I could be kind of nonpartisan about it and say like, ""This is clearly a thing that's happening, and I have no sunk costs, so get in there."" That's really cool. It's actually kind of inspiring. I like it. What was your initial research on when you got to grad school? What were you looking at? I was working on healthcare problems. I had had some personal health experiences that were pretty devastating earlier in life. And I think that was just sort of always a motivating thing of, ""Could we be making a lot of these kinds of inferences better that guide medical decision making?"" It still is a kind of overriding, organizing motivation in my work. My research is a little more diverse. I don't just do the, ""I want to grab things and get empirical results on, say, a specific medical dataset."" Although, I do have a bunch of research in my portfolio that is applied to medical work, but also the motivated kind of underlying theoretical and methodological problems. But that was how I started PhD, was working on medical stuff. I wrote a statement of purpose that I think caught the attention of some people, like UCSD, which is where I ended up doing my PhD. There's a division that does biomedical informatics, and there's a computer science department. One's in the med school, the other's in engineering school. And I think they had been talking about maybe getting a joint student at some point, or someone who would be funded on one of the medical informatics training grants but be a student in CS. And they were looking for someone like that. What I was hired to do essentially was to work on healthcare problems, but I kind of just sort of ... I started with that motivation and looking at what people are doing, but I was sitting in a computer science department and watching what's happening with machine learning. So for example, I suppose the first problem I worked on was something in text mining. So it was medical articles. And we were doing massive multi-label classifications. So all the medical articles that get indexed by the NIH are tagged with some subset of this large controlled vocabularies. Kind of enables things like systematic reviews of literature. And so just a simple ... Like back when we were using linear models. And the challenge was that it was 27,000 classes, and we're trying to predict them all and do it in an efficient way. And now it seems kind of quaint because it's like ... You train language models with like billions of parameters and vocabularies that are like 300,000 words and it's not that big a deal. So I started working on that, but I was seeing what was happening in deep learning. And I think the first kind of bigger break that wasn't just a kind of minor paper was, we were watching everything that was happening. Convolutional neural networks where maybe the thing that were catching the most attention 2013, '14. But I was interested in a lot of these problems that had more sequential structure, so I was getting medical time series data. Like people are admitted, there's a bunch of measurements, they're getting updated over time. And so I started paying attention to natural language processing, what was happening, because that's another problem on a kind of sequential structure. And I was seeing things like these papers in 2012, '13, '14, that like [inaudible 00:07:27] gave, and other people like that were doing with language modeling and seek to seek type things. And you start thinking, ""Are these methods sort of limited to these kind of neat, ordinally, kind of sequenced things like language? Or would they also work for things like messy multi-variant time series data that you have in clinical settings?"" And so Dave Kale, who I mentioned earlier was the guy that they tried to recruit, I had actually met him when I was starting PhD at UCSD, actually at machine learning for healthcare. One of the first years of that, when it was still ... It wasn't even a conference at the time, it was like a symposium. And so we got together, this is like second year of PhD, and we kind of had this idea of ... It wasn't obvious at the time. Now, anything that looks like a sequence, people throw an LSTM at the time. But at the time, was really only making headway popularly in language. And a little bit maybe on top of RNN combinant type things like on top of video or stuff like that. And so we were interested, ""Can we do much better than kind of status quo at predicting things like length of stay mortality, recognizing diagnoses based on ..."" And so you have these time series where the added complications are you have a bunch of different variables, some of them are missing, they're not observed at some fixed interval on the wall clock, they're observed at different times. If you try to re-sample to make a statistic of the time series that's reflective of a fixed wall clock time delta, then you wind up with missing data that's not truly missing, but it's missing as an artifact of the sampling frequency. It wasn't observed in that window. So then what do you do? How do you impute it? Do you carry it forward? I guess you have a lot of windows where nothing happened? Yeah, yeah, yeah. Right, exactly. Say your heart rate's measured continuously automatically by the equipment. However, the coma score is recorded once per hour by the doctor when they make the rounds. And then some serological result, maybe it's checked once per day or maybe some days it's never checked, or something like that, you know? Well, if you choose that time interval that's somewhere in the middle, like hourly, and you have this one thing that you're measuring that's happening multiple times inside a window, this other thing that's only happening once every like seven windows. I mean, an alternative way that you could represent it is you could just say every measurement is a ... You don't have the time tick for the RNN correspond to a fixed delta on the clock, but you can make it correspond to the observation and say something like, ""Add as a feature. What is the time lapse since the last observation?"" That's a little bit like those event based representations that they use for music generation and stuff like that. In our case it didn't work as well. I mean, I'm always curious ... It's funny, we've talked to a whole bunch of people from different angles in the medical field, but can you give me a rundown of the current state of the art in ML and medical stuff? Like what are the most impressive results that you've seen recently? So there's a bunch of slam dunk results, I think. I mean, you have to divide up the categories of problems. I think a lot of people ... You see a lot of the whatever the public think pieces about ML and healthcare, and they just kind of slop everything together. And it's just like, the AI is making decisions, and you'll have an AI doctor, and is it better than a regular ... It's kind of just the way that collapses doctorness as to a single task. Sure. I think the reality is, you have a whole bunch of different tasks. Some of them are really clearly recognition problems. Like it's a pattern recognition problem, and the environment around that problem is so well understood that if you solve pattern recognition, then you know what to do with the answer. So you don't have a real policy problem or a decision making problem, you just have a ... I put in this things like ... Now I'm going to get angry letters from, I don't know, some specialist that I'm saying they're automateable or something. But I think the things that are most amenable to this are the results like the diabetic retinopathy, where they take the retinal fundus imaging and they're able to predict whether or not someone has retinopathy, and do it, say, as well or better than a physician can just by looking at these images. This is one of those things where the doctor knows what to do if they're a hundred percent sure about the diagnosis. If you could just do the diagnosis more accurately, it's good. And then you know what to do. And you do the diagnosis here purely from an image? So it's essentially an image classification test? Right. Exactly. Things that sort of just reduce to, ""Hey, it's a pattern recognition problem. That's all we're doing. That's all the doctor's doing."" Those things you can ... Pathology, I think, has some of these, like diagnosing things based on microscopy. One of the best papers I saw on machine learning for healthcare in the first year that it was a publishing conference is people said, ""Hey ... "" It turns out they were attuned to the climate. They were actually writing from Uganda, and were ... The paper's very straightforward, but the problem was ... The A plus part of this paper is how well motivated it was. It said, ""Hey, there's ..."" Three of the biggest maladies in Africa were tuberculosis, malaria, and intestinal parasites. These things are diagnosed based on basically pattern recognition by human doctors looking at microscopy, like microscope images. Africa, at the time, as it was arguing in the paper, they didn't have nearly enough technicians to be able to give timely diagnosis to everyone. And I think at the time they said something, it was some to do with like ... Because it's much easier to diagnose ... Or it's much easier to donate a microscope than a microscopist. So there was a situation where there were more microscopes than there were technicians on the continent. And basically, it was like, if you just do pattern recognition really accurately, you can ... And you can even avoid a lot of the pitfalls that normally plague machine learning. Like you could standardize the equipment, just send everyone the same damn microscope, the same phone camera for taking the picture, et cetera. So they train a simple combinant, there was not a lot of like ... You didn't need to do anything super novel methodologically, and you ended up getting like 99% accuracy on doing this four-way classification among these [inaudible 00:13:40] done. This is an important problem. You can imagine shipping that tomorrow. Not really tomorrow, but you get the idea. Does that really work? I see a lot of these kinds of results, and I wonder, do they really work or is it somehow a more toy version of the real problem? Right. I mean, I think that's almost always a concern when you look at machine learning results, right? Because the results that you see in a typical ML paper almost always on a sort of randomly partitioned holdout set. So you're always worried about basically, ""Hey, I've ... "" Everything in the paper is sort of conditioned on the faithfulness to that idea assumption. That my training data and data I'm going to see in the future really can be regarded as independent samples from the same underlying distribution. And that's almost never true in practice. And the question is, is this true in a way that just completely bungles up everything you've done? Or is this ... So an example of where there's a huge discrepancy, is you have people saying that we have human level speech recognition. And then if you ever actually use your speech recognition, it's really clear that it's nowhere near human level. So what it means is, on the training corpus, if you randomly partition it, and you only look at the maybe accuracy on catching the ... Actually, I take it back. They're not looking at phoneme level error rates. They do look at word- I take it back. They're not looking at phoneme level error. They do look at word error rate at this point. But you get the point. It's like, if you make this really strong assumption that the training data is... And people confuse these because they use the same word. They say ""generalization"" in both cases. But one is the extrap... Or maybe what you might better, rather call ""interpolation"" than ""extrapolation"" of, ""Do I generalize from the training set, to samples from the exact same underlying distribution?"" Yep. The other is like, ""Can I tolerate the sort of perturbations and distribution that are assured to happen in practice?"" And so I think this is the thing that people deal in a really clumsy and kind of ad hoc way with right now. And a lot of my more theoretical and methodological research is about, what are actually proper sound principles according to which you can expect to generalize under, perform under, various shocks to the data generating distribution. So then I want to get to that, but I feel like I took you off on a tangent for no reason there. So, just going back to- You take me on a tangent, and I'll oblige. I appreciate it. But sorry, the other medical examples that you think are impressive. I think you were laying out like an ontology of it. Right. So I think the retinal fundus imaging, I think there's that long pipeline of productionalizing things in clinical trials, and I'm not actually up to the minute on where those are in that process. But that would be stuff that I'd be really confident would see it to production somewhere, if only as an assistive tool that like, ""Hey, if the doctor disagrees with this, get a second opinion."" Yep. So that stuff I think is really out there. But then you see the other things people are talking about, people started talking about management, conditions, decision-making. And they started training models to do things like predict what would happen based on past decisions or whatever. Now this stuff, it gets way, way, way funkier. Or all this kind of stuff that has a flavor of... There are maybe two things that people do. One is estimating conditional probabilities and pretending that they're estimating treatment effects. And they're just acting as though knowing probability of death given this, and death given that, actually is giving you really deep insight into what would happen if you intervened. Probability that someone dies given that they had a treatment is very different from probability that someone dies given that I intervene and give him that treatment, when in the historical data, this person always would have received a different treatment. So I think you have that kind of work, where there's a huge gap between the kinds of things people are trying to say about how... You have two sides, people who really understand causality and therefore really measure it and conservative about the kinds of claims they're making. And then other people putting out associative models, and acting, and writing in a way that seems to confuse whether they're associative or actually causal models, in terms of the kinds of decisions they could plausibly guide. Or you have sometimes people doing things like off-policy RL, where you look at things like sepsis management, or whatever, and you try to say, ""Well, okay, can I fit some kind of..."" It's the same as the RL problem, like I've observed a bunch of trajectories sampled from one policy, and then I fit a model, and I make an estimate of what average reward would I have gotten under this alternative policy. But being able to make that kind of statement is still subject to all kinds of assumptions that you need in causality. Like that there's no confounding that the past treatment decisions are not actually influenced by any variables that you yourself don't observe that also influenced the outcome. So all of these kinds of things, when people start talking about guiding decisions, making better treatment decisions, inferring all these kinds of things from observational data, I think there's a huge gap between the way people are talking and getting things into practice. But maybe those are the very most important things to actually be working on. And then you have the easily cordonable ML pattern recognition problems. Like just, ""Can I look at an x-ray and say, 'Is it pneumonia or not?' Can I look at a mammogram and say, 'Should they be recalled or not for diagnostics?'"" And so where does this time season series analysis stuff that you were talking about in the beginning fit into that? Is that at a point where it's a tool a doctor could use? For example, the first big paper that we did on this is the one we published at ICLR, which is Learning to Diagnose with LSTMR. And then so we're feeding in the time series and predicting which diagnoses apply to this patient. So I think you could paint a story that's not totally crazy about how this could potentially be useful. And one example would be, ""Hey, I have a new patient. There's some kind of emergency, that I have the patient, I have them hooked up. I'm recording data. If I'm not sure what the diagnosis is, it would be nice to be able to have a short list."" So that's part of how we evaluate. I could look at what the machine thinks are the 10 most likely diagnoses, and I could say, ""Okay, I'm going to make sure that I include these things in the differential,"" or something. It would be some kind of sanity check, like you're using the machine as a wide pass to just make sure that you're considering the right diagnosis. Is that actually useful directly, like in its form? You know what I mean? Like, there's a question of, ""Could that in general, that kind of idea, work, and is this sort of maybe a proof of concept, that it's plausible?"" I think we can maybe make that kind of argument. But in terms of, for the specific cohort, like for the patients in the ICU, is this really something where what we did is directly useful? I think you have to really lack humility to go out there and just say like, in an unqualified way, this is actually useful in practice. I think probably not. Like, I think for a lot of those patients, basically, we're able to demonstrate this technology is capable of recognizing these facts about these patients. But in reality, the diagnoses for a lot of these patients was already known. We're just showing that we can figure out what it was from certain trajectories, certain traces, certain measurements. If the doctor already knows the diagnosis, what do we really do to improve care? And I think this is how my research has evolved. I started off maybe asking a lot more of these, which is dangerous thinking with representation learning. And like, ""Can we do anything useful with these types of weird looking data?"" You know, the standard thing you remember from the early 2000s or whatever was like, ""Always find a way to represent whatever you're working with, as like, a fixed length vector. And then feed it into like, menu of [inaudible 00:21:49] learn models, or whatever. And see what comes out."" It was exciting to say, ""Could we actually get signal out of these varying lifetime series, with these weird- missing those patterns, and whatever."" But you know, at some point, okay, like the representation learning thing has happened, and we know that we can do this. And there's less things that are truly exciting there. Because we sort of know how to... We have a good set of tools, between sequence models, and [inaudible 00:07:17], and graph convolutions, et cetera, for representing various sorts of exotic objects. And that's no longer, maybe to me, the most exciting thing. So the most exciting thing is, ""Okay, we can do function fitting. Let's just say we can do function fitting. Let's say we even believe that we've solved function fitting. What's next?"" Like, that doesn't get us to the AI doctor. That gets us to maybe we've solved retinal fundus imaging. But for the most part... Here's another problem, to just poop on my own work a little bit more. And one thing that we often do, is we make these statements about, what is human level performance on some task. But we often don't think about the wider scope. We're sort of myopically focused on... Like, in ML, you're really told, I've got my inputs, I've got my outputs, I've got my loss function. And then the room inside there? That's where you dance. Right? But think about the diagnosis problem. This is an example I like to give my students, is, the way we cast the diagnosis problem in ML is, given all this measured data, can you infer more accurately or as accurately as the human, what is the applicable diagnosis? But was that ever the hard part? The extreme example is like, if the doctor gives you the test for Lyme disease and the result is positive, the fact that the machine can more reliably look at the data that contains that fact and say, ""You have..."" That's an extreme example. But you get the point. It's like, given that you were already routed to the right kind of care and had the right measurements done and whatever, maybe the machine is good at doing inference about what you have, but maybe that was never the interesting part, that was never the hard part, that was never the part that really demanded that you need a human in the loop. The hard part was seeing a patient. You have no data about them. And you have to make these hard decision problems. Decisions are not just about treatments. There's also decisions about information revelations. That's something we focus on a lot in the lab. Now it's these weird problems where the decision is what to observe. Like, I want to estimate... I want to ask them to figure out what is the best drug to treat some patient. I've got a bunch of people coming in. I can run some tests, but I can't run every test for every patient. I could try some treatment, but I can't run every treatment for every patient. So like, if I were to cast this kind of problem... And you can make it really abstract. You could just say, ""I've got some kind of set of variables, they're related by some causal graph. In every time step, you get to observe some subset of them, and you have some budget that constrains which ones you can intervene on. But the point being that it's like, the set of data you observe not being taken just, like, by god, given to you as something that you take for granted, but rather, widening the scope of what we consider to be our jurisdiction as people thinking about decision-making and automation. Well, I'm obviously I'm a big fan of that area of research. Because I do think in practical applications, you do actually have some control over those things. Like what data you want to collect and how you want to collect it. And I do think it's a messier research problem. But probably more directly useful in a lot of cases, just because the function fitting stuff is so well studied, relative to the impact that it can have. Yeah. Sometimes, I think people have a... You've seen this before. You were Stanford math or something? You've seen the kind of weird hierarchies that people form within a discipline, this idea of like, ""Okay, there's the mathematicians, are on top of the physicist, are on top of the chemists, are on top of the biologists, are on top of the applied"" whatever, whatever. And this thing happens in ML a little bit with theory and application, where people get snooty. And I think one thing that's weird is that there's two axes that get collapsed there, of theory and application, or mathematics and empiricism. Like mode of inquiry versus method versus real world. And I actually think that maybe it's true that theory is somehow more philosophically interesting than just benchmark applications, than just empirical pursuit on methods. But the application is a different axis. And I actually think that the applications are super philosophically interesting. They force you to ask... Because they ask you to ask questions that aren't just mechanical. You have to ask the normative questions, right? Like, the thing that I think is exciting about applications is that nobody told you in the first place what is worth predicting? That by itself, convincing someone that this is actually a problem worth solving. I was just reading one of your papers that you pointed me to, on, essentially, collecting more data. The way I would describe it is, it's about collecting more data, to get the model, to learn the things that you want or the connections you want, versus the sort of spurious connections. You had a good example of models predicting seagulls because they see the beach. You make this point that's evocative, of like, we assume that that's bad, but it's kind of hard to articulate exactly what's bad about that. Because it hurts you in generalization maybe. But if it doesn't hurt you in your data set, it's probably harder to distill what's bad about that. Right. You have all these papers out there that are just saying the model's biased, or the model depends on superficial patterns or spurious patterns, or whatever, without any kind of clear sense of what technically do they mean? And what we get out of that is trying to say, ""Here's something that I think causality has to offer."" I think a lot of people talk about causal inference really focused on the wrong thing. Thinking like, ""Is it useful, or is it not useful?"" Like, ""Can I take the Pearl machinery and go apply it on the real data, and estimate, and get the number."" And economists are, I think, more focused on that. Like, ""Can I get the number? Can I estimate it?"" But I think one thing that's nice about all this is Pearl's perspective. And I think that is really important. Causality is not just useful because you can actually estimate the causal [inaudible 00:13:39]. It's important because you can coherently express the kinds of questions that you actually care about. And at least within that, you can have a way of making coherent statements about things. So in this case, it gives us the vocabulary, of to say, ""In what sense is it wrong to depend upon the beach?"" When I'm saying this as a seagull. It's that it's not what causes it to be a seagull. Or an example that I like a lot of times is like, ""Why is it not right to base lending decisions, for whom you give a loan to, on what shoes they're wearing?"" And so part of it could be that you know something about how shoes relate to finances. Like, you know something about the structure of the universe. And you're able to think in your head, ""What happens if I intervene on your shoes?"" You know, if I take someone and I intervene on their shoes. Because you know people can intervene on their shoes, right? If everyone who wears oxfords to the bank gets a loan, and everyone who wears sneakers doesn't, people will intervene and say, ""Is this a reasonable procedure?"" One reason why I say, ""This is why I want to depend on this or not on that,"" is to say, ""What would be-"" I can do this counterfactual simulations and say, ""What would happen, where I to intervene on that? Would this change your ability to pay? Would this change the applicability of the label and the image?"" So I think for us, the big insight is just to think of it kind of coherently. So I think for us, the big insight is just to think of it kind of coherently as think of like semantics as actually sort of being causal in a way, like this was what causes the label to apply. Then it becomes maybe well-defined, right? Because, I mean, the benefit that we have in our paper, the learning the difference that makes the difference paper, we actually have humans in the loop. So we're saying, ""Hey, this is something that may or may not be actually identifiable from the observational data alone, but it's something that we can get via the annotators."" They're revealing to us ... I read this example about genre in movies, right? So if you train a classifier to predict sentiment on IMDB movie reviews, you find that top positive words, or you do something like just train a linear model and look at high magnitude positive coefficients vs. negative, the high positive ones would be like, fantastic, excellent, whatever. Negative ones would be like terrible, awful. But the positive ones also have romance, and the negative one also has horror. You're like, ""That's wrong."" Why is it wrong? It's because then Jordan Peele comes out of nowhere and starts making all these great horror movies, and your model's inferring that they're bad because it's kind of depending upon this thing that the signal is not durable over time. I was kind of thinking in that example, though, that I think romance movies are generally better than horror movies, and maybe the average human agrees with me. So there is some sort of ... Right, but that's an associative statement, right? You're saying they are generally better, and that actually does seem to be what the general public agrees with, right? The problem isn't are they generally better? It's does it have to be that way, right? Could you imagine a world in which tastes shift and the talented movie makers really shun romance movies and they become bad? I mean, so there's a sort of embedded assumption here. It's something that we're looking into a lot now, and for anyone in the audience who's really interested, there's a lot of great work by a scholar named Jonas Peters, who's maybe more of a theoretician, but approaches these problems. There's questions about you say ... Partly the question, one way of motivating us as you think about robustness out of domain, you say, ""When I go out into the rest of the world, is it always going to be true that romance is good and horror is bad? If I go to a different culture, do I expect that that can ... If I can move to a different state, do I expect that this is the durable part?"" So one kind of assumption here is that the things that cause the label to apply, that this relationship is actually stable. So you can imagine that the things that actually signal positivity versus negativity in a document, that this is relatively stable over years, but there's a complicated relationship in the background that influences, is the perceived sentiment positive? Is the movie quality high? What is the budget of the movie? Right. What is in vogue? What are the houses spending money on? What are the publishers saying about what's getting distributed, whatever? But these things are all changing. But the causal features are ... You can think of if there's a structural equation that says what is the perceived sentiment from the text that that thing is actually relatively stable over time compared to these other features. That's part of our empirical validation. So we have this model, right? We show that what we essentially get people to do is to rewrite the document. They're told to make a sort of a minimal edit, but it should alter the document such that it accords with the counterfactual labels. So it was originally a positive review. We say, ""Edit the review without making any gratuitous edits such that it is now a negative review."" When they do that, you wind up with a new dataset, where for every original review tha had horror in it and was positive, now there's a sort of bizarro counterpart, and it still has horror in it. The reason why it has horror in it is because of the instructions. The instruction said, ""Don't make gratuitous edits. Don't change facts that are not material to the sentiment."" So this is something that we can argue about whether it's actually statistically possible to have disentangled that horror is or isn't a causal feature without that intervention. But once we have this document, we say all the horror movies still contain horror, but their label has been flipped. All the romance movies still contain romance, but their label has been flipped, because other parts of the document, the ones that actually needed to change in order to flip the applicability of the label, have been changed. So if you train the model on the counterfactually revised data, you find that the coefficients flip. So excellent and fantastic are still positive words. But now horror is also a super positive word, and terrible and awful are still negative words, but romance becomes a really negative word. The cool finding's if you combine these two datasets together and train on them, they kind of wash each other out. So you find that all of the things that look like they don't belong on these lists of important features actually seem to kind of fall off. So we're dealing with causality here in maybe a more gestural way. We're not using the mathematical machinery of graph identifiability or anything like that. But we are getting an interesting kind of really suggestive result on real data. When we look at it, just to that last point that we were talking about with generalizing out of domain and are the causal connections durable, one thing that we looked at in the camera-ready version of that paper is we say, ""Okay, we trained it on IMDB. Let's now evaluate it on Yelp, Amazon, et cetera, et cetera."" When you go to those other models, the model that was trained on the counterfactually augmented data, which is the combination of the original and the revised, does much better out of domain. That's just not guaranteed to happen. The ports are not shared. There's a lot of funky things happening statistically here. But what I think is suggestive here is it's like it does sort of agree with the intuition that you say on movie reviews, horror verse romance is an important part of the pattern. That's a real clue. But once you start looking at Amazon electronics or something, that's no longer actually maybe a durable pattern. Someone's like, ""Oh, my Discman was such a horror"" or something. Well, I think what I really liked about that paper was sometimes I feel like, at least for me, some of the highly theoretical papers kind of point out problems, and they're kind of hard for me to even engage with, because I don't sort of see the practical effect. But you have actually such a simple mechanism proposed here that actually worked in your case, which I thought was super cool. I've noticed in my 15 years of working with ML teams, a lot of teams naively intuit to do things like what you're saying, and they usually feel bad about it. They feel like they're kind of doing this weird manipulation of the data to try to get it to generalize better by literally often rewriting the text in structured ways. So I don't know. I just really enjoyed the ... It's a cool paper with a cool theoretical motivation that I think is really important, right, of kind of eliminating different types of bias and making these generalize better, but then also an interesting practical way of doing it. It's reminiscent of active learning techniques and things, but more interesting. Cool. Yeah, thanks. It was fun to write it. It was scary for a minute, though, because we're asking these workers to do this weird kind of thing and not sure of the results. It was sort of like a little bit of coin relative to- Sure. ... the pot of discretionary funds at the time. So it was sort of like there was this moment of, ""Well, what the hell are we doing?"" But yeah, it was nice that it worked out. I mean, I think that's just mainly one of the differences between a sort of ... Not to get into academia versus industry culture wars, but I think something that academia done right affords you is it's not like we need to get the product out or something. We have this thing more after, and it's like, okay, you have that intuition that this mechanism might be interesting. But the next step isn't just do it or not do it. It's like the ability to have a PhD student spend a lot of time to have kind of arguments about this for a couple months of, ""How do we want to do this?"", agonize over the experiments, kind of go back to ... Let's say we drew a toy causal model in our heads. What does this correspond to? So we have a lot of followup work coming from that now, but the fact that you get that, for somebody, it's their full-time job for a year, is thinking really hard about a problem. You can get from, ""This is something kind of wacky. Maybe let's try it,"" and then call it ... versus ""Okay, now this is your full-time job for a year, is we're going to think really hard about this one problem."" Yeah, yeah. That's super cool. I was kind of curious. So I was also looking at another recent paper that you pointed me to that was a little bit kind of harder for me to parse, algorithmic fairness from a non-ideal perspective. Could you describe what you're doing there? Yeah. So this is a paper with ... So I actually have a postdoc in the philosophy department now. So he's working with me and David Danks, and this paper is really about ... I guess in some sense, it sort of touches on the high-level theme of identifiability, which is ... There's a lot of well-founded concerns. If you're going to have decisions automated, these are decisions that in general are addressing problems that are sort of ethically consequential when there's bail decisions, lending decisions, hiring decisions, mediating the flow of information, any of these decisions. All the normal questions that you have about and concerns that you have about fairness and equity and justice continue to apply. I think as machine learning has gotten widely deployed, people have sort of become more and more aware of this. I think in 2015, I was starting a blog, or 2016, on this. I didn't even know there was this community out there of people working on it. There wasn't conferences like the Fairness, Accountability, Transparency, and whatever. Now it's kind of blown up, and it's blown up for a few reasons. But I think there've been a few pivotal things that caught people's attention. One, there was the hiring screening thing that was filtering out resumes by female candidates. Probably the biggest thing that caught people's attention was the ProPublica article about machine bias. This is talking about recidivism prediction models. It's predicting who will get rearrested if released on bail. So you have these systems, and suddenly, basically, the claim is these systems are being used to guide sentencing decisions or maybe bail release decisions, and they're biased against black people. This is obviously a big problem. Then immediately sort of people, there sort of arose this crisis of, ""Well, how do you quantify that? What is the quantity that says there's bias?"" So someone says, ""Well, let's compare the false positive rates compared to the false negative rates. You have this whole kind of literature. Let's compare just a fraction of people that are released on bail among all defendants."" You say, ""Well, the distribution of crimes among defendants are maybe not the same."" You have these metrics that are based on thresholds, but you're not necessarily considering all aspects of the distributions. People come back and these kinds of criticisms. Ther sort of emerged this whole community that spans sort of algorithmic fairness, which is looking at these kinds of problems and trying to say, ""What are formal ways we could define fairness?"" So you might say the model should functionally behave equivalently regardless of what your demographic is, fixing all your other data, and then the criticism against that is you say, ""Well, that's meaningless, because if you withhold gender, but you have access to, say, all of my social media data or you have access to some sufficiently rich set of covariates, someone's gender is captured there. So what does it mean to say just that you didn't explicitly show that bit in the representation? If the information's there, you have it. So what does it mean to say it didn't impact your decision?"" So there's this whole kind of line of work that's sort of trying to express this problem formally, and they're trying to express it in a world where everything is sort of defined statistically in a world where basically what we know is there's a set of covariates, which are just some numbers, some distribution. We'll call it X. There's a demographic indicator. It's like, ""Are you in Group A or in Group B?"" There's the predictions by the model, and there's the ground truth. This is sort of like now trying to say, ""What are the kinds of parities that we want to hold?"" So maybe I say, ""I want an equal fraction of the population classified as positive, whether they're in Group One or in Group Zero. I want the model that doesn't actually look at the demographic. I want them to have the same false positive rates. I want them to have the same false negative rates. I want them to have the same [inaudible 00:13:38]."" So people propose- Do you think you could put these in the ... Sorry. I'm trying to connect these to, as you say, false positive, false negative. I'm just imagining the cases. I mean, can you say these in more real-world cases- Sure. ... so people don't have to make that connection? Right. Actually, this is sort of the kind of focus of a lot of our critique, is that. You could just describe the world in those terms and zoom out and start talking about various kinds of equations, and you could say a whole lot of things that seem intuitively plausible or reasonable, like, ""I want this to be equal to that."" But what's missing from this whole thing, so when people talk Word2Vec and say that Word2Vec is biased, Word2Vec is discriminatory, Word2Vec is racist, what does that mean? What actually is even the category of object to which these statements apply? You kind of realize really quickly that we've sort of obstructed so far away from the problems in that description that we don't have the relevant facts to say what is fair. So example would be if a model is being used to predict whether or not you're going to commit a crime and being falsely predicted as going to commit a crime means that you get denied bail or something, being predicted positive is really bad. If the model is trying to predict who, condition on were they to be hired, would be likely to get promoted and it's using this to guide resume screening or something like that, then getting predicted positive is good. So in one case, maybe- ... then getting predicted positive is good. So in one case maybe, you'd be concerned about false negative rates. Like if someone who really has the skill level being denied the opportunity for the job. In another case, you'd be concerned about false positive with someone who wouldn't commit a crime, be flagged. But lost in all that conversation also is whether or not something is like justice promoting or like justice whatever your normative positions are. And you fix any set of normative concerns that you say define your morality. I would argue that even anywhere within the normal spectrum, there's still a problem that these descriptions of the problems aren't sufficiently rich to say what you should do. Because the facts that are omitted are what actually is the problem I'm addressing. If there's disparities in the distributions initially, what caused that to be? If I'm making a decision, what actually is the impact of the decision? What is the impact? How has it actually helped or hurt people if I change this decision making process? So an example might be, let's say you have a process that is determining like admissions to higher education. Like in this case, intervening in a way that created more parity in the decisions I'd argue or create more demographic diversity in the ultimate decision, I'd say is a good thing. Now that's my normative position. Maybe someone who's not as progressive disagrees, but we can disagree about that. Even fixing my set of positions, if you change the situation and you say the issue is something like you're certifying surgeons or something, does objecting someone to say a different standard across demographics actually help or hurt their careers? In this case, that might be a bad thing because if you were to alter the decision making process that would say a safety certification, then maybe the reality, like the real world impact would be to almost legitimize discrimination further down the pipeline where now patients are going to treat doctors differently because they know they were subjected to different tests. So there's these different decisions that have different kind of... But because of what actually is the decision you're making and what actually has impacted the decision. Something that looks from a technical perspective, like an identical problem could actually have a very different interpretation in terms of what is the justice promoting policy to adopt. So the concern is that by abstracting away from all those relevant details, you kind of lose sight of this. What we ended up finding, and this is really [inaudible 00:47:41] gets credit for this. And I think a big contribution of this paper is really just making this nice connection across like a very wide interdisciplinary boundary, is that this is almost exactly in some ways a recapitulation of a lot of arguments that have been had for decades in the moral and political philosophy literature. There you have two approaches to theorizing. You have many approaches, but just like one of the axes of differentiation and how to theorize about these questions of justice is the ideal versus the non ideal approach. The ideal approach says, let's just imagine a perfect world and just say that things that hold in the perfect world, we should just fixate on one of them and try to make our world look more like that. It's saying... So you can think of the reason why this can go wrong. For example, this kind of theorizing has been used to oppose policies like affirmative action in a blanket way where you just say, ""Well, in the ideal world, we'd all be colorblind. So therefore we don't need affirmative action."" That's unjust. The non-ideal approaches in some ways is a more pragmatic way of looking at these sorts of problems where you say... Right. So among other things missing from the ideal approach is you don't say anything about what... You say how should someone behave in a world that is already in some ideally just or fair state and where everyone else is completely compliant with what justice demands of them and your job is to not fuck it up. That's very different from the non-ideal approach where you're saying, ""Hey, I live in this world, there are existing disparities. Now, given that I live in this world, given that there are these disparities, given that there is all these people who are bad actors out there, what is the justice promoting act?"" And to recognize that that's not necessarily the same thing. Then you have to be concerned with, well, what are the disparities? Who has the right or the power or the legitimate mandate to intervene on what kinds of decisions to try to rectify them? And then what are the policies that are actually effective? So I guess these questions become... If you remove those details, these questions become vacuous. I'll give you an example, would be higher education admissions. So if you just say like, ""Okay, well we want to have the same fraction admitted among men and women."" I think most of the people saying that aren't actually paying attention to facts. This is among what population, right? So if you were to look at like a typical school, there's already a huge gender disparity in the application. So if you just accept people at the same rate. If you take fix anyone problem and you really start going deep, you see that there there's all these other details that, what is the right thing to do? What actually counts as the fair decision making process? Hinges really precariously on a bunch of facts that are not represented in the problem descriptions. So I think that's our angle in this kind of critique is to cast a light on that. There's this common saying in the Fair ML world, like, ""Oh, we have 72 definitions of fairness or something like this. Look how many definitions. And the kind of maybe TLDR is like, we don't have 72 formal definitions of fairness. We have zero. The reason why is because you have 72 different fairness inspired, parity assertions, but the real actual question of fairness is the question of what are the material facts that you need to make a determination about? Which apply in a given situation? When you look at the different topics in machine learning, is there one that you feel like people spend way little too time on, like one that you just think has way more impact proportionate to the amount of attention that people give it? My only reluctance is that there are things that are... At least the trajectory is on the right track, like people are paying more attention. But I think in general, coming up with coherent ways of addressing problems that are beyond the IID setting is really key. I would subsume under this, like both addressing causality and mechanism, and also include robustness on the distribution shift. You have like one very narrow subset of distribution shift problems, which is the mini max like adversary setting where the adversary is able to basically have the same underlying distribution, but it's like the samples are composed with some asshole who's able to like manipulate your data within the L infinity ball. So you've got like four million people working on that problem. But in the broader set of, what are the kinds of structural assumptions that allow us to generalize under distribution shift? I think we have maybe... This is a problem that plagues every single real world ML setting and that even among papers that say they're working on this problem, I think the vast majority of people don't seem to even understand the basic concepts. For this technology to actually be usable, I think we need to have coherent principles under which we can make confident assessments about when it's going to be reliable and when it's not. I mean, that's obviously a little bit biased maybe towards my research agenda, but I think that's- No, that's fine. That's why we ask the question. It really is. I mean, I guess that's sort of like the common sense what is done for how you should choose a problem is like you should pick something that you think is important and under appreciated and not over appreciated. Yeah. Fair enough. I think actually you should feel happy that even that situation, I think somehow it's not logical that people get stuck working on problems they don't think are the most important problem maybe, or at least based on some of the conversations we've had. Part of that is people being lazy. Part of that is the friction, right? If you had a thing that you thought was important once, and then you built your lab around it and you got funding on it and your whole life revolves around maintaining this research. I guess now that I'm running a big lab and now that I have finances to worry about and all that, I'm a little bit more appreciative of the handful of people out there who really did make these hard left turns at some point. I think Michael Jordan's a nice example of that. You could say he's like Miles Davis or something, but it's like, okay, each decade he had neural networks vision on parametrics. Now, I guess it's like mechanisms and markets or whatever he's working on. Well, you've made quite a leap from music to deep learning I think. Yeah. I think it's time for me to retire. I think five, six years, that's the left turn. That's the left turn limit. The final question is... I don't know [crosstalk 00:54:42]- I have a mortgage now though, so it's a little bit harder. But you have a fancy computer behind you there. I don't know what. That's actually my Power Mac from '95, '97, maybe. Does it work? Oh yeah. Did you have one? Something like that. Yeah. With like HyperCard and yeah. Yeah, I think there's still like Oregon Trail and like Diamond, all those like weird free wear games, like Max Ski. Oh yeah. Max Ski. Epic. Final question. When you look at taking machine models from the ideation or the research paper to actually deploy it in production, where do you see the biggest bottlenecks or things falling apart the most? I think the biggest bottleneck is still problem formulation. I think if we were to be really sober of most of the things that like people thought they were going to do, and then you look at the way they propose the problem and then the data they could actually collect and if they could produce it, and does this in any way actually address the problem that they thought they were going to? I think would be... I don't know how you would collect that statistic and there's some like measurement questions, but I think it would be like really depressing. It'd be really sobering that I think most things people think they're going to do are either kind of goofy and who knows if they work or just like not relevant, will never get used. I think that that's figuring out where there's really a well motivated application of machine learning and what it is. There's like that weird next, the kind of pieces of information that you're asking people to put together. I think this is why, not to be like data scientists are great or whatever, but like why I think people are really good at this job or really hard to find in some way. And at the same time it's kind of puzzling, right? Because I don't think that the great data scientists are in general great or even rate-able mathematicians. I think for the most part, of people actually touching data are mostly lousy mathematicians. They're usually not world-class engineers. I certainly am not. What is it? I think it's this weird combination of like the weakest link kills you. You have to see... I think good at doing this applied work. What is the important problem? You have to also know, what is the current technology like in the ballpark of being able to get you on this kind of problem? How does it match against the data that's available? Then I think you have to at least at an intuitive level, do this non-statistical thinking of what's actually the process where you're deploying it? The x-rays or whatever it was, we talking about that, their retinopathy imaging or something. This is sort of a good application of machine learning because of what those images look like isn't changing over time. But you look at all these places in industry, people trying to build recommender systems, do all these things where it's basically... It's like totally incoherent. Nobody has any idea what happens when you actually deploy it because you're modeling something that's only like in the vaguest or weakest of ways actually related to what you think your would like to be predicting. You're predicting clicks or whatever. You're predicting a condition on the previous set of exposures. Almost never with any kind of coherent accounting for what happens when you deploy the model. I think this obstacle is people making that... I think this is always in some ways the hardest part of intellectual work in this area is the bindings. First level difficult is like technical skills. Are you a good programmer? Are you a good engineer? Do you write proofs that are correct? You do whatever. But I think the conceptual difficulty in working on machine learning is do you make the connection between this abstraction that you possess and the world that you're actually trying to somehow interact with? That to me, I think often is where all kinds of things go off the rails. I think where a lot of even good academic work goes off the rails. It's like you can go down some rabbit holes asking like really second, third order theoretical questions about these fairness things without ever asking of does this actually match up on to any real world decision that I would want to make? Does it actually help someone with a problem that I purport to be motivated by? I would just say that... I mean, I don't know if that's kind of a banal answer is like- No, no. It's great. Asking questions the right way or something, but... Sure. Well, thank you so much, Zack. Thanks for taking the extra time. That was super fun. Yeah, for sure. Thanks for having me.",10745 +Anthony Goldbloom — How to Win Kaggle Competitions,https://www.youtube.com/watch?v=0ZJQ2Vsgwf0,2657,2020-09-09,"If you think of professional athletes, they do a lot of training. Kaggle's communities are over five million people now. The people at the top are... it's just gone from more of an amateur sport to more of a professional sport, right? I think the difference isn't that the results would be better, but the top performers now would get to those results faster. You're listening to Gradient Dissent, a show where we learn about making machine learning models work in the real world. I'm your host, Lukas Biewald. Anthony Goldbloom is the co-founder and CEO of Kaggle. Kaggle started as a competition's platform and is probably the largest community of data scientists and machine learning practitioners on the planet. They also now do datasets and Kernels and we're going to talk a lot about that. I'm super excited to talk to him today. I remember you were talking about deep learning before I'd really heard about it. You were kinda the first person I knew that was really thinking about it, I think it's because you saw that people were winning Kaggle competitions with new methods that were less mainstream at the time. And I'm kind of wondering if on Kaggle, you're seeing people doing things that you don't think are on the academic mainstream or if you're seeing things that point to what you think could be mainstream production in the next few years. It's a good question. I mean, to be honest, the most glaring thing that we see on Kaggle that is not fashionable in academia is that we're still seeing gradient boosting machines do very well on a lot of structured data problems. And there's not a lot of research attention on things like gradient boosting machines now. It begs the question like, have we done everything we can there or is it an area where there is still more that can be done but it's just not the trendy thing? It's hard to get papers published and so it's just not getting the attention. To be honest that's the that's the most glaring difference we see between what is doing well on Kaggle and what is fashionable in the academic literature. Well, it's really interesting. We are seeing some novel uses of things like BERT and WaveNet being used on forecasting problems. We've seen BERT do really well on chemical informatics type or sorry, problems that have to do with gene sequences and things like that. So we're seeing like use cases, perhaps unknown use cases, for some well-established technologies but I think just the lack of academic focused on these gradient machine algorithms is probably the biggest glaring distinction that we see. And so in a structured data competition where gradient boost wins, what are the details that the winners do to win those competitions? Is it still feature engineering or is there other stuff? Exactly. And it's finding clever features that other people aren't finding. Perhaps that's it. That's the reason or maybe that's where you could have more academic focus like otherwise to there is no doubt in my mind there are things you could do to automate feature engineering to some extent, and maybe there are some things that companies like Data Robot and h2o are doing when they are baking these recipes. So as an example, you see a date or a time stamp, for instance. That is an incredibly rich field that can become like 70 things, right? Let's say you're doing traffic forecasting, a timestamp can be turned into; Is it rush hour? Is it not rush hour? Is it a weekend? Is it not weekend? Is it summer? Is it winter? Therefore, does it change the probability of rain on a given day or adverse with no snow on the road and things like that? There are definitely things that could be done to automate components or help with some of the heavy lifting and feature engineering. And so, that could be an area of focus or perhaps maybe it's not an academic area, but it's the kinds of things that h2o and Data Robot and companies like that can build into their products. Well, that's so interesting. So if you roll back ten years or you took a competition at the start of Kaggle and then you take it out today, how much better do people do? Do they even do better? Do you think you could win? Could you take modern tools and win every competition at the start of Kaggle? I guess my question is, has the feature engineering really improved? So the thing when Kaggle first got started, people used to use all sorts of things like support vector machines, self-organizing maps, really a large range of things. The first big development that we saw or the first big contribution I think Kaggle made is we made it very clear that Random Forest was the best algorithm for actually most problems at that time, and then, let's say about 2014, Tianqi Chen at the University of Washington released XGBoost. I think it was always thought that gradient boosting machines should be better, you know. It's a very smart approach to run someway decision trees. They should be better. They're very, very finicky before XGBoost. When Tianqi Chen launched XGBoost, it really took over from Random Forest. I'd say, unlike the difference between deep neural networks on computer vision problems versus Random Forest, the XGBoost increase is not a huge increase, but it was enough. And so, to be honest, unstructured data problems, I think that's probably where most of the software driven improvements have come from. It's probably the case if you took a problem from the early days of Kaggle, that was one with Random Forest and you reran it today, I think you'd get a little bit of a better answer because of the XGBoost. I don't actually think the way people are doing feature engineering has really improved very much. However, I do think that you would get.. What we always say is that Kaggle competitions, typically, you know, the top teams all converge on about the same level of accuracy and the intuition there is there's only so much signal in a dataset, right? And so people compete to the point where they've extracted all the signal. So I believe in the early Kaggle competitions, people always found that the key features and they got all the signal out of the dataset. I think what might happen is it had happened a fair bit faster now. If you think of professional athletes, they do a lot of training, Kaggle's communities are over five million people now. The people at the top are.. It's just like gone from more of an amateur sport to more of a professional sport, right? And so I think the difference isn't that the results would be better, but the top performers now would get to those results faster interestingly is my guess. Definitely. We ran a challenge with Pete Warden who's now at Google. He was running a company called Jetback and I think it was to distinguish competition between cats versus dogs and I think we did that, if I remember correctly, we did that before deep neural networks and afterwards and obviously saw a fairly big lift. We ran a challenge with the Allen Institute for Artificial Intelligence on solving an eighth grade science quiz. This was before the BERT innovations. People were getting about 60% accuracy using information retrieval methods. Allen have now run BERT-wide solutions on it. They're getting about 90%. So you definitely see... The before Deep Learning and after Deep Learning, you see very large changes in results for sure. And I would think that on some unstructured data like language models and things would really make a big difference, right? Yeah. I mean so you definitely have structured data where you have fields that are text fields, et cetera, et cetera. And so maybe you use language models to create features that ultimately go into.. I mean, that's a common strategy. We run multimodel competitions sometimes where you have images and someone will run a convolutional neural network in order to come out with features out of that image, that then get fed into a gradient boosting machine classifier. So that definitely happened. And so just as it gets done.. For images that might be part of a multimodel dataset, of course, it can happen as well when there are columns that are text or data sources for a challenge that are text. Are there interesting data augmentation strategies that you see people using? I feel like often people talk about that as a major area of innovation, and so do you see that on Kaggle? There's a couple of ways people win Kaggle competitions. In the world of structured data problems, it's clever feature engineering. It's very often that, let's say for natural language processing or for computer vision problems, it's clever data augmentation that wins competitions. And some of my favorite example is the Kaggle community is really creative. I remember I think it was with Quora, we did a challenge around detecting insincere questions. I think that was the challenge and the winning strategy there or the thing that the winners did that others didn't do is that would translate the question from English to some other language and then translate it back, because if you use Google Translate, it's not a symmetrical translation, right? And so it was like a clever way to augment that dataset. So there are the standard techniques, you know, rotating, et cetera, for images. But then there are clever, creative tricks like that translation lap. There have also been a bunch of, you know, one of the libraries that has really taken off on Kaggle, I think was written by a Kaggle master. It's called Implementations, which makes it much easier to do sophisticated data augmentation, particularly on... that's designed for images, so it's definitely an area of a lot of focus in our community. That's really interesting. Do you think that with feature engineering and augmentation being so important and then also like compute resources increasing , has overfitting in the training process become more of a problem over time, have people come up with new ways to address that? Like, I would think that if I'm just spinning through millions of possible feature combinations, I would end up overfitting on the validation data, right? Earlier I said there are two things required to win Kaggle competitions; tenacity and creativity. I actually think there are three; and that's being statistically wise. So one pattern we see all the time with people who are competing in their first Kaggle competition is they'll submit, they'll be top of the public leaderboard and then what we do is at the end of a competition, we rescore everybody's algorithms on a second test dataset that they've never seen before. And also, let's say you put in, you submitted 150 models, you have to select two that get rescored, right? And so a very, very, very common pattern is that somebody will be on top of the public leaderboard. We then rescore them on the second test dataset and then they've dropped like 90 places. And it turns out to be an amazing way to learn the lesson of overfitting because, you know, you're staying up till midnight on leaderboards, ten over midnight UTC (that's when we reveal who actually won). And so you're staying up till midnight hitting refresh, refresh, refresh, you know, am I in first, am I in first? And you look at the public leaderboard in the private leaderboard when we switch over, you're not in first. You're not in second. You're in eightieth position. That person who overfits in a situation like that, they never overfit again. And I'd say that if anything, that was a bigger problem, coming back to your question, has overfitting on the validation set become more prevalent? I would actually say it has become less prevalent because it has become well known that to do well on Kaggle, you really need to be careful of overfitting. But just one more, kind of, adjacent point I want to make here is you'd be surprised that the credentials of some of the people who have had that happen to them where they're coming first and then they drop to a 100th spot, it makes me wonder how many of the world's research papers are actually overfitted. You know, how many of the world's algorithms and production or overfitted if experienced good people who have a lot of models in production or a lot of research papers out in the world, when they come to Kaggle in a place where they cannot overfit and they still overfit. Yeah, it really does make me wonder, like what percentage of the world's algorithms are actually overfit? I mean, it sounds really tough, right? Because I imagine myself in a situation where I'm trying to be statistically rigorous. But I think when you're testing lots of features, you know, I feel like it's tough to be perfectly statistically rigorous in that process. Do you have a sense of best practices that you tell people or where do people learn that they don't overfit at all? I think it's cross-validation. I think cross-validation works well. There are techniques and as a Google researcher who has had a kind of clever approach. If you have a very small dataset, it's difficult to cross-validate where you don't get told that your algorithm outperformed unless it outperformed by a statistically significant margin you see intuition. And so you either get no information or you outperform by a meaningful amount... That actually sounds pretty good, although you might miss stuff, I suppose but yeah, that's an interesting idea. Yeah. I mean, I think cross-validation is still the right approach if you have a large enough dataset and if you don't, you know, but this technique is at least the one I'm aware of that allows you to still prevent against overfitting. Interesting. Also, one thing I really wanted to ask you is you know, I think there's almost like this trope in the ML circus is that's like, you know, the real world isn't a Kaggle competition. I'm sure people ask you about this all the time. But I was curious what you think about what the differences are like, do Kaggle grandmasters tend to do well in the real world? Like what parts of Kaggle translates to actual applications, what parts don't? So I guess there are... Let's back machine learning into, or building a successful productionizing machine learning model, into three stages. Firstly, you've got to turn your business problem into a machine learning problem. The second is you've got to try to classify that it's robust, et cetera, et cetera, and that it really works. And then the third is you have to productionize that classifier. Kaggle is obviously phenomenal for number two. I really think if you do well on Kaggle, if you train through Kaggle, you become as good as anybody in the world on number two. And I just want to maybe extend that a little bit more. Wonderful story from a very elite Kaggler, who's a senior engineer and leads an autonomous driving unit. He started there as a very junior engineer, and he made a name for himself in a bunch of, you know, well credentialed people on that team. PhDs from famous machine learning universities had been working on a problem for three months. He took it home over the weekend and got way further over the course of the weekend than they had gotten over a three month period. And that's because Kaggle challenges you to see lots of different datasets and you're working to a deadline. And so he was able to look at that problem. He had seen enough problems directionally similar, had a good intuition for what the right approach was going to be and so he was able to come up with a very, very, very good solution very quickly. He's now a very senior engineer on that team, and that's how he made his name. And so I think Kaggle is outstanding for number two. People say we're irrelevant for number one, which is turning a business problem into a machine learning problem. I actually don't agree with that. I think Kaggle trains that muscle obviously less directly than the actual training of models. But the way it trains that muscle is, you see, the business problem is described in the text of the challenge and then you see how that team set it up. And so you see lots and lots and lots of examples of how different teams have taken a business problem and set it up as a machine learning challenge. The area where I think Kaggle gets dinged and fairly so, is on number three - taking a prototype model and productionizing. This is the kind of thing that really you have to be inside a company that's productionizing models in order to get experience. But I don't think today that people are missing out on a lot. The process of having a model that has been prototyped in a Jupiter Notebook and then productionized is an incredibly painful thing. It's not like they're wonderful. And by the way, I think this will change over the next few years, but I don't think that somebody trained on Kaggle is going to go into a company and think, ""oh, there's this whole world of things that are sort of well-established practices that I don't know"". I think number three is currently a mess and it's a painful thing in just about any company you go and work for. So I just don't think that you're missing out on too much. Yeah, it's going to be painful whether you've come up through engineering at Google or you come out through Kaggle. So I'm going to channel some of our audience. I know just from the comments that we get and the questions and the selection, I think we have a lot of folks that are trying to break into Silicon Valley type jobs and I think Kaggle is a good way to do that. Do you have any advice for someone that's trying to get into machine learning through Kaggle? Like, have you seen that work for people and how that process is done? Absolutely. Kaggle is now, at this point, a pretty well-recognized credential and in my view, a faster, more accessible way to break in than you know, a PhD from Stanford or the University of Toronto. In your unbiased view [laughs]. In my completely unbiased view. I mean, we see people.. anyone who's a grandmaster on Kaggle, and I think if you work at it, you can get a PhD in five years. If you really work at it, you can go from, you know, an engineer who's OK with math to a grandmaster in a year. And it doesn't cost you whatever student debt, et cetera. But how do you actually do that? Like, is it just doing competitions or... It's doing competitions. It's reading the forums. It's we have Kaggle Notebooks where you can start with other people's code. You focus, you ask a lot of questions in the forums and you can become a grandmaster across one of four dimensions. You can be a Competition Grandmaster, a Datasets Grandmaster, a Discussion Grandmaster or a Notebooks Grandmaster. In rank order, the most respected (probably by employers) is Competitions in my view and justifiably so. We prefer people who are either Notebooks or Discussion grandmasters, and the reason being is you have to be insightful enough that you're writing comments that people upvote or you're writing notebooks that people upvote and you have to write clear enough code or communicate clearly enough that you would also make for you know, the kind of thing you want on a team. I would pick one of those three categories and the kinds of places... Invidia has hired somewhere in the order of 20.. 10 to 20, I'm not sure the exact number of grandmasters over the last, let's say, six months. So they've been on a hiring tear for Kaggle grandmasters. Deep Mind has hired quite a few, h2o have a team of about 15 or so Kaggle grandmasters. So it's not a credential that every company goes for, but a meaningful amount of top AI companies have valued this credential. And then once you've got your first job, you're in right? You'll continue to get other good jobs. I feel compelled to add that I also love to see Kaggle credentials on a resume in Weights and Biases' top ML company. But I'll say for me, I even appreciate, you know, we hire a lot of engineers back and front end and sometimes they'd be like, ""hey, I did a couple of Kaggle competitions"". And I think that just says so many good things about a prospective employee, I'd love to see it. So I just totally agree with you. I guess I'm maybe slightly less obviously biased, but I love Kaggle credentials and I love it when people share them on their resume. One thing I want to make sure I had time to get to, just because you've told me privately I think, that in some ways the maybe lesser known publicly Kaggle products have in some ways more traction than the competition product, right? Can you say what the other ones are and why you decided to build them and how they're doing? Yeah, sure. So Kaggle's, between 2010 and 2015, we were an all machine learning competitions. We always thought that machine learning competitions are very powerful, but also not going to appeal to everybody. And as data science machine learning grows as a profession, we wanted to make sure that we were continuing to grow with it. The first thing we launched is Kaggle Notebooks, which is basically a hosted Jupiter Notebook. And really, it doesn't cost you anything. You get a CPU, a GPU or a TPU so you get as much CPU as you want and then you get 30 hours a week of either GPU and TPU. So you get quite a lot of free accelerators. It's so great that you do that. It's awesome. You know, we're part of Google. We would not have been able to do this as a standalone company. But the really nice thing here is you come to any notebook on Kaggle, you can hit the copy and edit button, we used to call it Fork, and you get somebody else's code running in an environment that will run it, right? So it's completely reproducible. We launched it initially inside competitions because we noticed people sharing their code in forum posts, either linking to a GitHub repo or attaching a Python script. We also noticed that those forum posts.. People came to Kaggle to learn, those that should be the most valuable forum posts, they got dramatically less interest than forum posts where people shared ideas. So it's like the most valuable content was not really getting utilized and that's because, you know, you get someone's Python script, you have to get the same version of Python, the same version of... you're on the right side of Tensor 1.0, 2.0, like all these dependencies, really takes some number of hours to get somebody else's code to run and then you don't know if it's any good. And so if we could create an environment where it was much easier for people to share code and not have to worry about the environment behind it, and that was the insight behind Kaggle Notebooks. We have somewhere in the order of 800,000 users every month .. I'm on parental leave so my knowledge of the data might be a little out of date. It might be over a million people looking at other people's notebooks, which is just extraordinary. The other product we have, kind of like how people share videos on YouTube, we allow people to share datasets on Kaggle and where this grew out of is we noticed when we launched Kaggle Notebooks, they were initially only useful for people competing in competitions to share code alongside competitions. But we noticed people using Kaggle Notebooks to share more free-form Insights, and so it's like, oh, wouldn't it be great if we gave people datasets as well, that they could do more free form type work on? And so we launched this platform where anyone can share any dataset. And we have, at least last time I looked at the metrics which again was before parental leave, was over 400,000 people downloading Kaggle Datasets. So I think these are.., remember the word data science machine learning side it's still an emerging profession and to have those sorts of numbers on these products, it means that they're getting really meaningful traction. Yeah, it's really cool. And they're also a nice way. Like Kaggle competition is a tremendous commitment. If you want to get involved in Kaggle but don't have the time to put into a competition, these are ways to get involved in a lot of white way. Do you have a favorite kernel or dataset that maybe doesn't get as much attention as it deserves? There's lots. There are all sorts of things. Somebody uploaded a dataset taking x-rays, and you had to predict the age of the person based on their x-rays. It was called the Bone Age dataset, which I thought was kind of cool. That is really cool. Just like really random. Aww man, what else? We've got a lot of cool covid-19 datasets. One of the things that has been nice with the covid-19 datasets that have been shared is that, you know, you've got the John Hopkins University dataset, for instance, and everyone's looking at that. One thing that people have been doing on Kaggle which is really powerful, is they've been creating these very rich panels where they join the John Hopkins dataset to daily data from the nearest weather station. And so there's this debate about how does temperature and seasonality impact the transmission of covid-19? Well, somebody has just.. And I've seen some of these studies, like some study, will look at a 100 cities in China and draw some conclusion about the rate of transmission and how temperature impacts. Well, people in Kaggle communities just hand delivered all well over 4000 locations with the nearest weather station. And so those that those kinds of things are also, I think, really cool and really powerful. That is so cool. And it's been out for four or five years, I guess. Kaggle Notebooks launched in 2015 and Datasets, that was about May. And by the way, Kaggle Notebooks really started as an edit box with a run button. We also launched very lightweight. We launched with R, not Python. Oh, wow. That's a sign of the times. Wow. That is amazing. The number one feature request was can you add Python? Although I should say some of the most beautiful Notebooks are written are still particularly for data analysis, data visualization. It does really well on Kaggle, beautiful content still created with that and then Datasets launched in August of 2016. It's funny. I really want to ask, these are just the questions I selfishly want to ask, but I guess it's my show so I can do it. When you look at the tools people are using, I mean just as an example, do people use anything besides Python and R and do people actually like win competitions with R? I say this as a long time R user who switched to Python so I'm not against R but it does seem like almost everyone has switched at this point. Yeah, I mean, when Kaggle first started, it was like Matlab, we even saw SaaS, we saw a whole lot of stuff and then it quickly became R and then Python's rise has just been sort of a fairly steady, it's just like steadily taking share. I'm not actually sure what the numbers are. Now, I know a couple of years ago it was two-thirds python, one-third R. I suspect it's probably closer to 90% Python, 10% R now. I'm actually not certain of that, though. I'm not aware really of R doing particularly well in Competitions. I think Python is hard to beat, like the support for neural networks, et cetera, et cetera, is, I think, a fair bit stronger in Python. And so I'd say the place for R has really shone in the community is beautiful beautiful Notebooks. You know new competition launches, somebody writes a Notebook. We have a user I have in mind when I think of, you know, I have a persona in mind, heads or tails is a Kaggler who writes beautiful R Notebooks, sort of helping anyone competing in the competition get their arms around what's in that dataset. So it still plays a nice role. I'd say that we've seen a bit... To the extent that people are using different types of technologies on Kaggle, we had a challenge fairly recently where h2o's driverless AI did really well. It was a forecasting challenge. We've had challenges in the past where Google's autoML have done well. I happen to be a huge believer in that category. You know so much of turning your own network is turning knobs, why shouldn't that be automated? Even so much of feature engineering, and we spoke about this earlier on, exploiting a date can probably be written into software. So I'm a huge believer in the next big thing in terms of model training is probably more and more automation of the things that are relatively easier to automate. And then the other place that we've seen new technologies get adopted, like some of the things that are generating excitement in the community, we launched TPU Notebooks and certainly some class of problems where TPU is doing extremely well, particularly on the natural language processing models, which are very, very, very memory heavy. I think there's a lot of excitement and videos coming out with a new chip soon, which I think there's a lot of excitement around. So any advances in accelerators gets people excited. Is this faster training times or literally the models perform better? So one of the reasons that faster training times matter is because we give people nine hours Kaggle Notebooks. So it's faster training times, but a lot of people are using our Kaggle Notebooks and so the fact that you can train a model faster, there also are some issues, I think that people have trained things like RoBERTa, which is a particularly memory intensive version of BERT and so the TPUs we make available have enough memory such that you know, it's the difference between being able to train a RoBERTa model or not. What about frameworks? How do they break down? Is everyone using TensorFlow or what's ...? No, I think PyTorch is really... We just see PyTorch is doing really well. Actually I don't know what the... TensorFlow 2.0 is obviously a big release. People love Keras. It's really Keras has dominated on Kaggle for a long time and now Pytorch is..., I'd say maybe Keras and Pytorch, at least last time I checked, was sort of roughly equivalent, the trajectory on PyTorch was very strong. I don't know that I have a good sense of the extent to which TensorFlow 2.0 has, you know, changed the PyTorch TensorFlow, you know, what that trajectory looks like yet, but certainly PyTorch's rise has been very, very, really incredible. And what about the simpler layers on top of PyTorch like Lightning and Fast AI and Ignite, like do you see those? Yeah. I mean Fast AI, definitely we see I think fast AI. Quite a lot of people do that course and then Jeremy mentions Kaggle a fair bit in that course. And so I think that ends up being a decent feeder to Kaggle and so people coming out of that course bring fast AI, the other two not so much Is there like. . Do grandmasters have different set of tools? Like do you see the more experienced people switching to stuff or...? I think what they often have is not so much their own tools, but they very often have built, what's the way to put it, their own little framework so that when they see a computer vision problem, they're not starting from scratch, but they're starting with code that they know well, that they know how to optimize well. And by the way, the fact that so many grandmasters are creating their own little frameworks suggests that the PyTorches and the Tensorflows aren't, you know, there's something missing or maybe it's that people like to do things on their own way or maybe they're like slightly too low level for the kinds of things that people are doing in the Kaggle community. So what are the common things that these frameworks do? I think they're very often like all the pipeline steps, right? So you've got your new image dataset come in and the first thing you want to do is you want to do data augmentation and they have their preferred way of implementing data augmentation and then passing it on to the next. So I think it's really focused on creating a pipeline that starts with a new dataset and then ends in a Kaggle submission file. Okay I want to totally shift gears because there's a couple more questions I want to make sure I get to in this time. So this is actually a Lavanya suggestion for a question, and I love it. So I remember you giving a TED talk about the jobs that we'll lose to machines. I remember you telling me about it. I thought it was super smart and you had a way of thinking about the jobs that you should be worried about losing. So I was hoping you could describe that to this audience who maybe haven't seen the TED talk, but I also was wondering if your thinking has changed at all since you gave the talk, because that was in 2016 and now things have changed a bit. Yeah. At the time maybe it was slightly not obvious to me. It reads now, it's like very obvious. The basic conclusion is jobs are at risk if you're doing the same thing again and again and again. Kaggle had done a path-breaking competition, taking images at the eye and diagnosing diabetic retinopathy and the results on that were just outstanding. And we've done a lot of other medical imaging problems. And you think about what... This is fairly hackneyed at this point, but a radiologist does the same thing again and again and again. Looking at an image, making diagnosis of that image; or an ophthalmologist. Whereas if your job requires you to be creative and to be sort of doing different things on a daily basis, you're probably much safer. You know some of the professions, I think I mentioned in the talk, at least I had in the back of my mind that surprisingly haven't gone, you haven't seen much automation yet, is like auditing, getting basically legal contracts, like ""do you really need a lawyer to look at yet another relatively boilerplate NDA ?"" ""Do you really need an auditor to go over the majority of the company's accounts?"" are probably fairly standard. So there's still a lot of work that in the four years since the talk and I'm not aware of a ton of work being done to automate. So it sort of points to perhaps opportunities or perhaps things that still haven't been done. But I think the conclusion still stands. You know, I was in 2016 fairly skeptical that we were going to make a lot of progress towards Artificial General Intelligence. I still.. I just don't see it. The current set of tools that we have, I think, are incredibly useful and they are very powerful and they allow us to do a lot of really cool things. I don't see a path from where we are, a smooth path that doesn't involve multiple step changes from where we are today to AGI. And the distinction between repetitive tasks and tasks that require more creativity and more moving around still to me is true as where jobs are aside from where they aren't. Do you have any opinions on reinforcement learning? I feel like we've seen a lot of interesting stuff coming out that certainly feels like a different kind of intelligence to me. Have you looked at that? Yeah, I guess with reinforcement learning, where Kaggle aims to be is somewhere between where the cutting edge of academia is and where the average Fortune 500 company is. So I guess we believe in reinforcement learning enough that we've started actually, I guess two or three weeks ago, we launched our second ever simulation competition. It was with two sigma where people are writing AI bots to beat each other in a game called Haywire. In January, we tested the concept with a game of KinectX and so we are now investing in reinforcement learning. You know I have heard of some cool, pragmatic applications as well covered, deep-mined, optimizing data Google data centers. I've heard about potential uses and search and ad targeting in stock market trading so I guess I'll start to get excited when there are more, Kaggle is excited enough that we're investing and that we're making reinforcement learning challenges available to our community. I'll be more excited than I am today when there's more pragmatic use cases that we can point to where it's really making a positive difference. That makes sense. Alright. So we always end with two questions. The penultimate question here is what's the topic in machine learning that people don't talk about as much as they should? I mean, one of the things I'm really energized about at the moment, just particularly with the success of Catalyst datasets platform, is I want to make it easier for people to find access and join external datasets to their own datasets. There are raw materials, right? So the easier it is for us to integrate them into our machine learning algorithms, the more powerful our work is going to be. And so that's one area. And then the second one is one I mentioned earlier; Algorithm boosting machines still to incredibly well in Kaggle challenges and it's just not an area of academic study at all. And it makes me wonder, is there more that could be done? Is this an area that's being overlooked? Is there more that could be done around gradient boosting machine-like technologies? Have we abandoned too fast? It's funny, I worked a lot on gradient boosted trees back in the day. It's probably the algorithm that I spent the most time with and I remember feeling like they had trouble learning linear relationships, like they're so good, you know, these step function things. But is it just gradient boosted trees or do we add more weak learners to their gradient boosting? What we typically find with structured data problems is people who're trying to , with their text-u boost or LightGBM or some gradient boosting machines framework and if you look at it, that is what's doing 99.9% of the work and very often people will ensemble in other things just to get a bit more diversity. In most cases, I think that it really, sampling other things, helps take you from twentieth to first. But actually when a company looks to productionize a model that comes out of a Kaggle challenge, they will strip out all the other stuff because you just don't want the complexity. So I think gradient boosting machines is really, typically, enough on most structures. That plus clever feature engineering is enough on most structured data problems. Definitely. I classify that as interesting. The final question is, when you look at, and this is maybe outside of Kaggle, but like at Google and all the companies that you've talked to, when you look at the path from inception to deployed machine learning software, where do you see the biggest bottlenecks? I like this analogy. There's this company at the moment that's pretty hardcore web flow. And what they do is they help with the same between a designer and a front end engineer by making it much easier to get a design into HTML, CSS, JavaScript code. And I think that's probably an area, taking a prototype model written in Python, possibly in Jupiter Notebooks into a production system, you know, maybe the production system is in Java or something else is really nasty. You've seen a bunch of companies invest. Google obviously is invested internally, invested in a system called Michaelangelo. There's a lot of systems that have been built inside companies to try and solve that problem. But a bunch of startups now are trying to take those systems that have been built internally for the likes of Google, for the likes of Uber and to make them available to the wider world. I think that's definitely a problem that urgently needs to be solved. The same between data scientist and machine learning researcher and data engineer is a really [nilees] scene at the moment. Well said. Awesome. Well, thanks a lot Anthony, that was actually fun. Cool, thank you. Thanks for having me.",7018 +Suzana Ilić — Cultivating Machine Learning Communities,https://www.youtube.com/watch?v=uKjX-iJGKyA,2096,2020-09-02,"I think the most important thing is to do something that you are really interested in, because if you're starting it, a lot of things will depend on you. And the key is, I think somebody wrote it on Twitter recently; the key also to MLT is consistency. You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Suzanna Ilic is a founder of MLT Machine Learning Tokyo, which is a huge community of people working on and learning about deep learning. She has hosted around 100 machine learning related events in the last two and a half years and built an incredible community. I am super excited to talk to her today. Suzanne, it's so nice to talk to you. I was really looking forward to this because I see that we share these two interests in common. One, seems like the democratization of A.I. and another is EDGE computing or deploying-deploying the hardware. So I'm super excited to hear about what you've been up to. I thought maybe we would start with Machine Learning Tokyo. I would love to hear about why you started it and what it does. Yeah. First of all, thanks so much for having me. I'm super excited. I love Weights and Biases and visiting SF, so I'm super excited to be on this podcast. So thanks for having. Yeah, MLT is a Japan based nonprofit organization and it's our kind of core mission is to democratize machine learning. So we want to make machine learning and deep learning as accessible as possible to as many people as possible because we believe that you know, Machine learning is going to be everywhere, is going to be some standard component in the software stack in the very near future. So I think a lot of people should know what it is and be able like to navigate. And we mainly do this through open education, through open source. So we build a lot of open source projects and open science, so we work with universities. And yeah, we were here in Tokyo and we support research and engineering community of about I think four and a half thousand members. Oh, four and a half thousand? And so how does it work? Like how did people join the community and what do they do? So it depends. Like there is many ways how to join the community. You can just be an attendee of the meetups or join a workshops or Hands-On sessions, and then you can just join Meetup and you get all the information you need there on an upcoming sessions. But there's also like more active ways to join MLT. So if you want to contribute, if you want to work on open source or if you want to, for example, hold a workshop or lead a study session, you can join slack and talk to me. And there's like many ways how to how to be more actively involved in the community. What inspired you to start MLT? So we started, I think two and a half years ago and it was basically just out of our own needs. We were two people and that's how MLT started. And so I'm a domain expert in Machine learning, I come from a very traditional academic background and I'm a trained linguist and I was always working with text analysis and NLP. I was using very simple methods. At some point during my Masters, I was working on sentiment and emotion and effect and I realized that these kind of very simple statistical methods give us like some intuition, some insight about a corpus, about a data set. But language is full of very complex and very beautiful things like metaphors and humor and analogies and irony and sarcasm. And you know, that's not possible to grasp with those very simple tools. So I think three or four years ago, I started reading about Machine learning and Deep learning and Neural networks and I got super hooked and I realized, OK, having learning algorithms, having algorithms that learn from data directly instead of from rules or lexicons might be a way to understand language better or to be able to process language better. So I started writing my first machine learning code three years ago, but I also realized, well, coming from a different background, it's pretty challenging. It's pretty difficult. And for me back then I knew, OK, I want to have this collaborative learning environment. I need to be surrounded by people with different backgrounds, people that have, you know, different skills and know different things than I do. And together or at least that was that was what I thought we could learn faster. And that's exactly what happened. So Yuvraaj, my co-founder; is also coming from a different background, from an electrical engineering hardware background. And he wanted to use machine learning and he still wants to use it for EDGE devices, Micro-controllers. And yeah, we started very small and we just met every week and wrote machine learning code. And every week, more and more people joined, even though it was kind of word of mouth. And after like a few weeks, there were so many people, we didn't know where to put them anymore. So we met in this open co-working space at Yahoo!; and too many people! So everybody wanted to write machine learning code. And then we started putting out our first meetups. And ever since it has been growing pretty fast. So we started from very small, but kind of, you know, out of our own need to, because in Tokyo there was no such thing back then like two and a half years ago, there were a lot of communities like great communities. But there was no like place to actually build AI, there was no place to work on hands on stuffs. So that's how it all started. That's call you built the community that you wanted to be a part of. That's so great. (smile)Yeah. How did you frame it? Like when you were first saying, hey, come join me? What was that thing to do like it was I got to learn ML together or read papers or how did you think about that? So the very first kind of, I think first six months or so, it was purely dedicated to going through tutorials. So really learning about how to write machine learning code and learning about the, you know, getting a conceptual understanding of different algorithms of the math, but mainly to write code. And that was how we started. It's just, you know, going through as much stuff as possible. And then once we kind of and, you know, the team grew bigger and more people have joined us. So, after six months, you kind of slowly started to broaden. So we did a lot more things. We did started doing Hands-On Deep Learning workshops in the first year. So, we had deep learning engineers who were working as full time at the Japanese companies and they were giving 5(five) hour deep learning workshops where we focus on writing life code from scratch and training a specific model training at in a week. We first focused a lot of computer vision. So we went through a lot of computer vision stuff and then gradually kind of moved into different areas of machine learning. And like as the community kind of progresses and grows, we see that we go into different directions. So now we have like a computer vision team that does CNN architectures and their own little ecosystem. We have a team that is fully dedicated to AGI, so running deep learning algorithms on hardware and micro-controllers and EDGE devices. We have a NLP team that does research in natural language processing. And everything is fully community driven. So there is no full time employees or anything. It's really how the community evolves and grows and that kind of broadens into different directions. That's so impressive. So like, how do you run a good workshop? Like, a five hour workshop? You know, I've seen really good ones and bad ones. Like, how would you do it to make sure that it a good experience for people? I think it was learning by doing then; in the beginning, we really didn't know what we were doing. So I think two years ago, where we held our first deep learning workshops, a lot of things were pretty difficult and pretty challenging because, you know, people come with different machines, with different skill sets, with different background knowledge, with different software and hardware. So that's pretty difficult. But, kind of slowly we got a lot of feedback in kind of first iterations and worked with that feedback. So things that made it easier for us is just, you know, focus on one thing that is really interesting to us, where we see value that can bring value to us as instructors, as deep learning engineers, as well as student communities for something that is very useful. The second thing, is like make sure that technically everything runs smoothly. So we switched, I think after a second or third workshop to Google collapsed. That makes it very easy, like to just write code and there is no like prerequisite except for having a G-mail account. But that solves a lot of the technical issues and problems that we had. Yeah. But does everybody like build the same thing together, is that how you run it? It's like you get your sort of, say, a problem that really works together. Like how much do you kind of coordinate like everybody doing the exact same thing when it's like people going off on their own. So it depends what kind of workshop we're doing. So if we have our standard deep learning workshop, there's typically a topic and we already have prepared like a repository with the model that we're going to build. We sit down with 50 people and then the instructor. We do some theories. So we do first like maybe an hour of conceptual understanding of what is going to happen, where we're going to build. And then Demetries, for example, he is like coding from scratch. So he basically walks you through from the very beginning to getting your performance metrics. And so these kind of workshops are designed to do exactly this,only this. And people just follow along with the code and they can life code from scratch. And this is something that people find really useful because especially that kind of life coding aspect, because sometimes when you're on your own, you look at, you know, blocks of code and you kind of trying to figure out what is happening. Try to figure out your own thing. It's useful if someone actually writes code with you and explains what is happening. It's you learn just faster probably. Or this is at least what I find to be useful. On the other hand, we have much more open sessions. So especially like our hardware sessions where we do AGI, the only thing we provide is a ton of hardware and then people come in. These are typically smaller groups, maybe 20-25 people, and then people come in. They build teams, they choose their hardware and they come up with their own idea and they build their own stuff. And then at the end of the day, each team presents what they have been working on. So it really kind of depends on the session, I guess. Do you think is there like a different kind of culture in Japan than, say, in San Francisco? Like are language barriers like an issue at all? What it's like to be sort of? I guess, I know what it's like to be in San Francisco, but what do you think that there's big differences coming from Japan? I don't know San Francisco that well. So I was and I went to a lot of meetups actually, and they're pretty cool. I think a lot more things are just happening in San Francisco. And I think a lot more things are supported probably in S.F. In Japan, language is definitely an issue, it's a huge barrier. It is something that I've been constantly thinking about. In Japan, there are amazing communities in machine learning; there are two super big machine learning communities, one is TensorFlow user group that is very related, of course, like to Google and then Deep Lab, which is I think affiliated with Microsoft. Those guys are very big and the very, very Japanese; so everything is in Japanese. And then there's us. I think we're like similar in size but we speak in English. And yes, this is one thing that has been bothering me so much because I'm always trying to find ways how to not have these isolated communities. So this is a challenge in Japan. This is definitely a challenge and we're working on it. But other than that, you see that communities are growing and there is a huge demand also for a machine learning talent. So, yeah, apart from the kind of very Japan specific problems like language barriers, I think it's a pretty, pretty good and active environment to be in. Yeah, remember, I went to Japan last year and I've worked out on off and on with Japan as a market, and I've always been impressed by how excited people are about, you know, machinery giving going back like 10-15 years seem like there's a lot of enthusiasm for it. And actually, I've been kind of wrestling. I just would like to find a way to translate our documentation into Japanese and kind of keep it up to date. Yeah. I've been thinking about that lately. Yeah, I think that would be a good move. We were also like only focusing on English, but there needs to be like this bridge and we need to start somewhere. So we also started translating. We worked with a T.A.(Teaching Assistant) from Stanford to translate their 'C' as deep learning course material, of course notes into Japanese to make it more accessible to people and have bilingual kind of resources for people. So we're trying also very hard kind of to include as many people as possible. That's awesome. What do you think when you think about democratization of A.I.? I mean, what else do you think is important? Like how do you think about that? Maybe, because of my personal background, because I am a domain expert, but I also see how important machine learning is and is going to be in the near future. I feel like, if possible, we should have as many people as possible involved in even technical stuff. So there have been a lot of democratization efforts, for example, If you look at H2O with AutoML, like making it really very easy to experiment, but also of course from other AutoML platforms from tech giants. For us, it's like a lot of education that we do. We work with a lot of universities, something that I kind of personally like doing is working with research scientists or students coming from different backgrounds. So I think, machine learning could be super useful for people that work with a lot of data. And we worked with a lot of super interesting people. For example, last year in summer, I think we were at the Tokyo Institute of Technology where we held a two day boot camp for Elsie; Elsie is the Earth Life Sciences Institute. And those guys are amazing. They are astrophysicists, the planetary sciences by all computational biologists, chemists like...you know, mind blowing stuff! And we had a room full of people and they all work with different kinds of data sets and problem sets and with different tools and techniques. And machine learning could be one way for them to get new insights and maybe even to advance science. So, personally for me these kinds of things are super exciting, getting like more domain experts involved into technical stuff, doing open education, doing open science; this has been pretty interesting. What about people without math or programming background? Do you think there's room for them to contribute, too? Yeah, absolutely. I think so. You know, there are Jeremy Howard and Rachel, they've been doing like the best job ever into getting domain experts on board. Right? You do have to have some coding backgrounds, so you should be able to write some python code. But going through fast AI courses, for example, it's a more top down approach. And they're exactly democratizing machine learning or making it uncool by having so much more people just involved. And this top down approach allows you to get into deep learning without having to have a PhD at Stanford and computer science or like a really strong math background. You build stuff, so you start with thinking about your problem and your data and to build stuff then afterwards you start digging deeper into the math, for example, that you might need for your particular project or problem. And I think I really like this kind of approach. That's very similar to what we've been doing with MLT as well. Even though we also do a lot of like fundamental work. So we also have like study sessions for machine learning, math and other things. But I think there's definitely room for people who are coming from different backgrounds. And I think if they find it even potentially useful, they should look into it. I mean, you've probably seen people go from kind of novices to knowing a lot about ML. And people ask me all the time, how do I get into this stuff? Do you like have any advice from the data that you're saying? And you know what folks should do if you have no background and you really want to go deep on this stuff? Yeah, for sure. So I think two things are super important. The first thing is, don't neglect your background, don't think that you have to start over from zero and you don't know anything before that. Leverage your background, leverage your professional experience, your academic background, whatever it is that you have been working on in the past years, leverage that. It's the same, there are many examples for that. For example, you could be a hardware engineer and you know a lot about hardware; and now you're getting into Machine learning and Deep learning. Now leverage that background and that expertise to learn about machine learning and learn how to combine these two things. In my case, it's language. So I've studied language as a system for many, many years and I use machine learning, and the combination of language and machine learning to bring interesting and unique insights to particular project that I'm working on. I talked to a recruiter here in Japan and I asked him, so what does the market need? And he said, well, it's here in Japan, it's not enough just to know Deep learning. You have to have some sort of specialization, you have to have some sort of domain expertise, some like way how you can use this kind of Deep Learning in combination with something else. It could be software engineering, it could be hardware, it could be language, it could be anything. So this is the one thing. And the second thing is when you're coming from a different background and you want to go into machine learning, there is, of course, like two approaches. Either you start with the fundamentals, you start with math or you do what I just earlier mentioned top down. You start with a project and you just write code, build that project and then figure out details later. And I think the most important thing here is to figure out what is interesting to you, what would be something that really catches your attention and you love working on and make that decision and then start working on that. Because, the problem here is that, there are too many options. You could do too many things. Everything seems to be interesting. But if you spend a little time here and little time there, you will get maybe some shallow understanding of a few things, but you'll not advance as quickly as you might want. So figure out what you want to do and leverage your backgrounds- probably my advice. Do you think that you see people being more successful starting from the fundamentals or starting with a project? Because you mentioned that there's sort of two different approaches and people gravitate towards one or the other. Do you have a preference or can both work? Both can definitely work. I think we were just like only talking about domain experts and people coming from different backgrounds. But of course, I think what research for the academia and industry needs just as much or even more is people with very, very strong CS backgrounds, with very strong math backgrounds that know how to optimize and know how to work on theoretical things. So I'm not saying this is not important, not at all. I think, of course, this is still the norm and this is what probably employers want to see the most. And if you're coming from a strong cs or math background, I think you already have a strong foundation to go very deep into machine learning and deep learning. But I just want to say, like there's room for other people as well. Ok, so this is a little bit outside of the scope maybe of a ML podcast, but I am just fascinated by this, so maybe it is. What about starting a community? Do you have advice on someone in an area like you where they want to find like-minded people? I mean, do you have any advice on that, like if I'm in a city where there isn't already like an ML group, how would you go about finding people? Yeah. And like so many people write me messages on that. Oh, yeah(smile)? Yeah That's so great! They are either in remote areas or in cities where like literally something like a machine learning community still doesn't exist. And I would always say like, 'go for it'! If there is no such thing out there, be the first one to do it. Because, MLT has evolved into an amazing community. Like, literally I'm amazed by how active and how engaged communities and all those guys they have; they have full time jobs, but they still find time to work on open source and to teach other people and to do these kind of workshops. So it's pretty amazing. So I would really kind of suggest to think about starting a community wherever you are. Do you have any practice(*) for getting people off the ground? Because, it seems kind of daunting to me to try to start that and keep people engaged. How do you get people to keep talking? I think the most important thing is to do something that you are really interested in, because if you're starting it, a lot of things will depend on you. And the key is, I think somebody wrote it on Twitter recently; the key also to MLT is consistency. So we consistently just keep doing stuff that we think is exciting and interesting. So start from what you're interested in, start from your own problem set or from your own need, and more people will follow. And then like more practical things like, we started doing remote meet up so there was no burden of having to find a venue and a sponsor and other things. So this is an option how to kick things off to find more people who are interested; that makes it very easy, like there is no easier way probably then to start like remote meet ups. On the other hand, if you want to start something in your city, first of all, you might want form a small peer group around yourself and try to figure out what you want to do and then start to look for a physical place and figure out if you want to do hands on stuff or if you want to do like more educational stuff; learn together and try to reach as many people as possible. And, you know, just yesterday I talked to someone, a journalist, and he said to me, wow, there's no such thing for writers out there. I want to start something for writers out there. And I think it's kind of the same thing. Right?There is a need for all these niche groups and communities. So I think if you get it out there and if you do things that you're very passionate about, people will follow. Do you have any thoughts on like diversity and any inclusion in ML in these groups that you create? Is that something that is top of mind for you? Yeah. That's something that is very important to us. Luckily within MLT, we're a very diverse kind of, four and a half thousand people in terms of countries and languages and skill sets and backgrounds and professional experiences. So this is really super diverse. But, yeah, women are super underrepresented. I think two years ago when we started on working on deep learning workshops, we had 60 engineers and I was the only woman. Wow. Yeah, so I realized we really needed to do something about that. So we're doing like very specific, not only events but also projects that support diversity and inclusion. We do a lot of women in machine learning events, they are supported by Google Japan, Mercari and other companies. We also do projects that I just earlier mentioned where in one of them, we had about 12 bilingual engineers that worked on translating some of the Stanford course notes into Japanese and having this kind of bilingual resources for people just to be more inclusive in general, also to the Japanese community because we are literally in Japan and we are very diverse. But it's still seems like there's a disconnection between a Japanese community and an English speaking community. And I think it has never been more important. We all know, Tech in general is multidisciplinary so machine learning should be as well multidisciplinary. We need people with different skills, with different expertise, with different backgrounds in general. So, this is something we all have to work on, I think. Do you have any other suggestions for making the community feel more inclusive? So in our core team, we decided very early on that we want to create an environment that is very collaborative and that is very inclusive. That means that we really don't want this as kind of elite math machine learning group. We want to include as many people as possible and we want to have like decision processes. We want to have the community involved in what directions we take, what kind of things where we're tackling on next. And every project that we do, in every workshop and every study session, we kind of have that sort of mindset. So when you look at our math sessions, like last year we started doing remote math reading sessions. So we're going through a book that walks you through some machine learning math and more than 1000 people signed up from all over the world. Wow We have sessions in the Bay Area, we have sessions in India and Apack here in Japan. And the thing is, it is very inclusive because the sessions, the people that join those sessions, their levels of math are very different; so we have complete beginners, we have people that are coming from completely different backgrounds. But in our Tokyo sessions, we also have mathematicians, we have experts, we have PhD's in math, people that have taught math for many years. And it's pretty amazing, after the reading it's a very interactive discussion where people ask all sorts of questions and together we kind of brainstorm around things and our experts like Emil and Jason, they try to explain mathematical concepts. And it's been pretty amazing. So I think really having this mindset that whatever you do, you need this. It's something that is actually enriching to whatever you do and that is very important and having that mindset is probably going to help a lot. Super cool. That sounds really fun. What is something underrated maybe in machine learning that you think people don't pay enough attention to? Something underrated? So, I think something still underrated in machine learning is data. Still?! Oh my God! Yeah, I think so. Like, it doesn't matter who I talk to, always I feel like it's a troublesome thing to do. Right. You don't want to work with data, you want to write machine learning algorithms, you want to train models, you want to get good accuracy and then push accuracy or metric. It's not about data. So data is kind of doilies that people think about sometimes or this is at least kind of my understanding of it. And I think we should definitely think more about data and put more emphasis on data. Maybe this is also because of my own background, because I've been working with data pretty much all my career and just three years ago started working with machine learning algorithms. But yeah, it all starts with data and it'll probably ends with data. I think Chip Huyen just mentioned recently, who owns the data pipeline will own the machine learning and production or the machine learning system. So yeah, I guess; I don't know if it's still that case. Maybe in SF, maybe in the Bay Area people think more about data. I don't know. I don't know. I think my background is similar to yours. And so I feel like data is so unbelievably important, I guess. Right, Yeah. It's not possible for Data to be properly rated for its contribution to ML. And then when you think about making machine learning work in the real world for real applications, what's the hardest part about getting it to work? So in our case, we love to experiment with new things. And I think, it's difficult when you're trying new things, you kind of need to figure out a lot of stuff. And generally, I think in production environments, there is a lot of experimenting and trying to see what works. So making a production pipeline work and deploying machine learning for different use cases has different challenges, from data to all the way to software engineering to monitoring your model like how it changes in different real-world scenarios. So even though like things are taking off, there's still a lot of room to work on these kinds of things, infrastructure things, deployment thing, finding new use cases, finding use cases that make a lot of sense for machine learning. But at the same time, I think this is super exciting, so this is something that really excites me probably the most is thinking about use cases and experimenting a lot and trying new things. At MLT, we do work on production things as well, but it's not our main thing. Our main thing is just trying out new things, experimenting and make POC (Proof of Concept), so we don't actually deploy a lot of things on a large scale to production. So maybe I can't talk about like the main challenges here, but what I can say is that we try like if we take EDGE, for example, we're trying out a lot of things. We're working with different hardware where we're trying to think about different use cases where these things can be deployed. And there is a lot of things that just don't work out and fail but that's totally fine. That's good as well. This is something that we also need to grow and to figure out things. But then at the same time, we also build things that work and that are super interesting. So, yeah, it's a lot of experimenting, I guess. Yeah. OK, so my final question. If listening to you talk I get excited about joining one of your virtual ventures, how do I find out more? How do I get more involved with MLT? Can I do that remotely? Yes, you can, definitely. So we do, as I mentioned earlier, like on Meetup, you can find all of our events and a lot of them are actually remote. So if you want to be part of an event or a meetup or something like that, you can just join meetup and we will post everything there. There's also more active things. So if you would like to work on open source or doing some other things or get more involved in general, you can join our slack group. There is pretty much the whole community, they are talking about different things, so probably more in technical depth. So you can also find people there to work on projects and do other things. And so these are kind of the main two things, the meet up for events and maybe slack for projects and other stuff. Awesome. Thank you so much. It's great to talk with you. Yeah. Thank you so much for having me.",5647 +Jeremy Howard — The Story of fast.ai and Why Python Is Not the Future of ML,https://www.youtube.com/watch?v=t2V2kf2gNnI,3069,2020-08-25,[Music] you're listening to gradient descent a show where we learn about making machine learning models work in the real world i'm your host lucas bewall jeremy howard created the fastai course which is maybe the most popular course to learn machine learning and there are a lot out there he's also the author of the book deep learning for coders with fastai and pytorch and in that process he made the fastai library which lots of people use independently to write deep learning code before that he was the ceo and co-founder of analytic an exciting startup that applies deep learning to healthcare applications and before that he was the president of kaggle one of the most exciting earliest machine learning companies i'm super excited to talk to him so jeremy it's nice to talk to you and in preparing the questions i kind of realized that um every time i've talked to you there's been kind of a few gems that i've remembered that i would never think to ask about like one time you told me about how you learned chinese and another time you gave me um dad parenting advice like very specific advice that's been actually super um helpful so it was kind of funny hey tell me what what dad parenting advice worked out well what you told me was um when you change diapers use a blow dryer to change a um a really frustrating experience into like a really joyful experience and it's like such good advice i don't know how you i guess i can imagine how you thought of it but it's yeah yeah no they love the whooshing sound they love the warmth i'm kind of obsessed about dad things so i'm always happy to talk about bad things that is this podcast can we start with that now now that my daughter's eight months old do you have any any suggestions for this oh my goodness eight months old you know it's like the same with any kind of learning it's all about consistency so i think that the main thing we did right with claire was just you know this delightful child now is we were just super consistent like if we said like you can't have x unless you do y we would never do x you know give her x if you didn't do y and if we're like if you want to take your scooter down to the bottom of the road you have to carry it back up again we read this great book that was saying like if you're not consistent it becomes like this thing like it's like a gambler it's like sometimes you get the thing you want so you just have to keep trying so that's my number one piece of advice it's the same with like teaching machine learning we always tell people that tenacity is the most important thing for a student it's like to stick stick with it do it every day i guess just in the spirit of questions i'm genuinely um curious about you know you've built this um you know kind of amazing framework and and sort of teaching thing that i think is maybe the most popular and most appreciated framework i was wondering if you could you could start by telling me the story of what inspired you to do that and what was the the kind of journey to making you know fast ai the curriculum and fast ai the yeah ml framework so um it was something that my wife rachel and i started together um and um so rachel has a math phd super technical background early data scientists and engineered uber i don't you know i i have a just scraped by a philosophy undergrad and have no technical background but you know from both of our different directions we both had this frustration that like neural networks in 2012 super important clearly gonna change the world but super inaccessible and you know so we would go to meetups and try to figure out like how do we like i knew the basic idea i'd coded neural networks 20 years ago but like how do you make them really good there wasn't any kind of open source software at the time for running on gpus you know dan sirisson and jurgen schmidt who his thing was available but he had to pay for it there was no source code and we just thought oh we've got to change this because the history of technology leaps has been that it generally increases inequality because the people with resources can access the new technology and then that leads to kind of societal upheaval and a lot of unhappiness so we thought well we should just do what we can so we thought how how are we going to fix this and so basically the goal was and still is be able to use deep learning without requiring any code so that you know because the vast majority of the world can't code um we kind of thought well to get there we should first of all see like well what exists right now learn how to use it as best as we can ourselves teach people how to best use it as we can and then make it better which requires doing research and then turning that into software and then changing the course to teach the hopefully slightly easier version and repeat that again and again for a few years um and so that's we're kind of in that process that's so interesting do you worry that um the stuff you're teaching you're sort of trying to make it obsolete right because you're trying to build higher level abstractions like i think one of the things that people really appreciate your course is the sort of really clear in-depth explanations of how these things work do you think that that's eventually going to be not necessary or how do you think about that yeah um to some extent i mean so if you look at the the the new book and the new course um the the chapter one starts with like really really foundational stuff around like what is a machine learning algorithm what what do we mean to learn an algorithm what's the difference between traditional programming and machine learning to solve the same problem and those kinds of basic basic foundations i think will always be useful even at the point you're not using any code i feel like even right now if somebody's using like platform ai or some kind of code free framework you still need to understand these basics of like okay an algorithm can only learn based on the data you provide you know it's generally not going to be able to extrapolate to patterns it's not seen yet stuff like that but yeah i mean um we have so far released two new courses every year you know a part one and a part two every year because every year it's totally out of date and we always say to our students at the start of part one look you know none of the details you're learning are going to be of any use in the year or two's time there's a good you know when we're doing thiano and then tensorflow and keras you know and then playing pie torch we always say look don't worry too much about the software we're using because none of it's still any good you know it's goal changing rapidly you know faster than javascript frameworks but the concepts are important and yeah you can pick up a new library in i don't know awake i guess do you um it seems like you've uh you've thought pretty deeply about um learning both you know human learning and and machine learning had you had um had you or rachel had practice teaching before was this kind of your first teaching experience um you know i've actually had a lot of practice teaching of this kind but in this really informal way partly it's because i don't have a technical educational background myself so i found it very easy to empathize with people who don't know what's going on because i don't know what's going on and so way back when i was doing management consulting you know 25 years ago i was always using data driven approaches rather than expertise and interview driven approaches to solve problems because i didn't have any expertise and i couldn't really interview people because nobody took me seriously because they're too young so and so then i would like have to explain to my client and to the engagement manager like well i solved this problem using this thing called linear programming or multiple regression or a database or whatever and yeah what i found was i very i wouldn't say very quickly but within a couple of years in consulting i started finding myself like running training programs for what we would today call data science but 20 something years before we were using that word yeah basically teaching our client and uh you know so when i was at eighty carnie i ran a course for the whole company basically that every uh associate nba had to do in what we would today call data science you know a bit of sql a bit of regression a bit of spreadsheets bit of monte carlo so yeah i've actually done quite a lot of that now you mention it and uh certainly rachel also um uh but uh for her on um pure math you know so she she ran some courses at duke university and stuff for post grads so yeah i guess we both had some some practice and we're pretty passionate about it so we also um study the literature of how to teach a lot which most teachers weirdly enough don't so so that's good do you have do you feel like um there are things that you feel like uniquely proud of in in your teaching or like things that you're doing particularly well compared to um you know other classes that people might take yeah i mean that i wouldn't say unique because there's always other people doing good stuff you know i think we're notable for two things in particular one is um code first and the other is top down so you know i make a very conscious decision in kind of everything i do to focus on myself as the audience i'm not a good mathematician you know i'm like i'm i'm i'm capable nowadays but it's not something that's really in my in my background and doesn't come naturally to me for me the best explanation of a technical thing is like an example in some code that i can run debug look at the intermediate inputs and outputs so so i make a conscious decision in my teaching to to teach to people who are like me and although most people at kind of graduate level in technical degrees and not like me they've all done a lot of math most people that are interested in this material are like me they're people who don't have graduate degrees and they're really underrepresented in the teaching group because like nearly all teachers are academics and so they can't empathize with people who don't love greek letters you know and integrals and stuff so yeah so so it's so i always explain things by showing code examples so and then the other is top down which is again the vast majority of humans not necessarily the vast majority of people who have spent a long time in technical degrees and made it all the way to being professors but most regular people learn much better when they have context why are you learning this what's an example of it being applied you know what are some of the pros and cons of using this approach before you start talking about you know the details of how it's put together so and we this is really hard to do but we try to make it so that every time we introduce a topic it's because we kind of need to show it in order to explain something else or in order to improve something else and this is so hard because obviously everything i'm teaching is stuff that i know really well and so it's really easy for me to just say like okay you start here and you build on this and you build on this and you build on this and here you are and that's that's just the natural way to try to teach something but it's not the natural way to to learn it so i i don't think people realize how difficult top-down teaching is but um people seem to really appreciate it yeah they do seem to really appreciate it do you think um i mean love to talk to rachel about this directly but do you think rachel has the same approach as you because it sounds like she has a pretty different background yeah she does have a different background um but she certainly has the same approach because we've talked about it and she we both kind of kind of jump on each other to say like hey you know because we kind of do a lot of development together or we did before she got onto the data ethics stuff more um and sometimes you know i'll say to her like hey that seems pretty bottom-up don't you think and she'll be like oh yeah it is damn it start again you know so we both know it's important and we both try really hard to do it but we don't always succeed and can you tell me about the um the library that you built like how that came about do you think it was necessary to do it to teach the way you wanted to well it's not it's remember the purpose of this is not teaching so we want there to be no teaching so the goal is that they're all minimal teaching the goal is that there should be no code and it should be something you can pick up in half an hour and get going so the fact that we have to teach what ends up being about 140 hours of work is a failure you know we're still failing um and so the only way to to fix that is to create software which makes everything dramatically easier um so really the software is that's actually the thing that's actually our our goal um but we we can't get there until you know we first of all teach people to use what already exists and to do the research to figure out like well why is it still hard why is it still too slow why does it still take too much compute why does it still take too much data like what are all the things that limit accessibility do the research to try and improve each of those things a little bit okay how can we kind of embed that into software yeah the software is kind of the end result of this i mean it's still a loop but eventually hopefully it'll all be in the software and i guess we've gotten to a point now where we feel like we understood some of the key missing things in deep learning libraries at least we're still a long way away from being no code but we at least saw things like oh you know basic object-oriented design is basically is largely impossible because tensors don't have any kind of semantic types so let's add that and see where it takes us and you know kind of stuff like that we really tried to get back to the foundations were there any other ones that was a that was a good a good one any any other that come to mind yeah i mean um you know i mean dispatch is a key one so the fact that um kind of julia style dispatch is not built into python um so function dispatch on typed arguments we kind of felt like we had to fix that because really in for data science the kind of data you have impacts what has to happen and so if you say rotate then depending on whether it's a a 3d ct scan or an image or a point cloud or a set of key points for a human pose rotate semantically means the same thing but requires different implementations um so yeah we built this kind of julia inspired type dispatch system also like realizing that to go with again it's really all about types i guess when you have semantic types they need to go all the way in and out by which i mean you put an image in it's a pillow you know image object it needs to come all the way out the other side as you know an image tensor go into your model the model then needs to produce an image you know uh an image tensor or a category you know type or whatever and then that needs to come out all the way the other side to be able to be displayed on your screen correctly so we had to make sure that the entire transformation pipeline was reversible so we had to set up a new system of um reversible composable transforms um so this stuff is all like we as much as possible we try to hide it behind the scenes but without these things our eventual goal of no code would be impossible because um you know you would have to tell the computer like oh this tensor that's come out actually represents you know three bounding boxes along with associated um categories you know and describe how to display it and stuff so it's all pretty foundational to both making the process of coding easy and then down the track over the next couple of years you know removing the need for the code entirely and what did you um like what was the big goal behind releasing a v2 of the library that was kind of a bold choice right to to just make a complete rewrite yeah i'm um you know i'm a big fan of second system you know the kind of the opposite of of joel spolsky you know i i i love rewriting i'm more i mean i'm no arthur whitney but you know arthur whitney who created k and kdb um uh every version he rewrites the entire thing from scratch um and he's done many versions now um that's that's i really like that as a general approach which is like if i haven't learned so much that my previous version seems like ridiculously naive and and pathetic then i'm i'm not moving forwards you know so i do find every year i look back at any code i've got and think like oh that could be so much better and then you rewrite it from scratch i did the same thing with the book you know i rewrote every chapter from scratch a second time so it's partly that and it's partly also just that it took a few years to get to a point where i felt like i i actually had some solid understanding of what was needed you know the kind of things i just described um and some of a lot of it came from like a lot of conversations with um chris lattner the the inventor of swift and llvm um so when we taught together um it was great sitting with him and we're talking about like porting ai to swift and like the type system in swift and then working with um alexis gallagher who's like maybe the world's foremost expert on the on swift's value type system and he helped us build a new data block api for swift and so kind of through that process as well it made me realize like yeah you know this is um this is actually a real lasting idea and actually i should mention it it goes back to the the very idea of the data block api which actually goes back to fastao version one which is um this idea that and again it's kind of based on really thinking carefully about the foundations which is like rather than have a a library which every possible combination of inputs and outputs ends up being this totally different class you know with a different api and different ideas let's have some types that represent that could be either an input or an output and then let's figure out the actual steps you need it's like okay you've you know how do you figure out what the input items are how do you figure out what the output items are how do you figure out how to split out the validation set how do you figure out how to get the labels um so again these things are just like yeah we you know came to them by stepping back and saying what is actually foundationally what's going on here and let's do it properly you know so fast ai too is really our first time where we just stepped back and you know literally i said um you know so silva and i worked on it and i said to silver like we're not gonna push out any piece of this until it's the absolute best we can make it you know right now um which i know silva i kind of got a bit you know filled i was a bit crazy sometimes like the the transformed api transforms api i think i went through like 27 rewrites um but you know i kept thinking like no this is not good enough no this is not good enough you know um until eventually it's like okay this is this is actually good now so is the hardest part the um the external apis then because that does seem like it'd be really tricky to to make that i mean that seems like an endless task to make these apis like clear enough and organized well they're never um i never think of them as external apis to me they're always internal apis they're what i mean because you want to make a bigger system yeah what am i building the rest of the software with exactly and you know we went all the way back to like thinking like well how do we even write software you know i'm a huge fan i've always been a huge fan of the idea of literate programming but never found anything that made it work and you know we've been big proponents of jupiter notebook forever um and it was always upsetting to me that i had this like jupiter world that i loved being in and this like ide world which i didn't have the same ability to explore in a documented reproducible way and incorporate that exploration and explanation into the code as i wrote so yeah we went all the way back and said like oh i wonder if there's a way to actually use jupyter notebooks to create an integrated system of documentation and code and tests and exploration um and it turns out the answer was yes so yeah it's really like just going going right back at every point that i kind of felt like i'm less than entirely happy with the way i'm doing something right now it's like to say okay can we fix that can we make it better and python really helped there right because python is so hackable you know the the whole the fact that you can actually go into the meta object system and change how type dispatch works and change how inheritance works so like our type dispatch system has its own inheritance implementation built into it it's yeah it's amazing you can do that wow why um because um the type dispatch system needs to understand inheritance when it comes to how do i decide if you call a function on a and b that you know on types a and b and there's something registered for that function which has some superclass of a and some higher superclass of b and something else with a slightly different combination how do you decide which one matches you know um so in the first version of it i i ignored inheritance entirely and it would only dispatch if you had the types exactly matched or one of the types was none but then later on i added yeah i added inheritance so now you can you've got um this nice combination of multiple dispatch and inheritance which is really convenient isn't um can you give me some examples of how the inheritance works with your types because i would think it could get kind of tricky like what's even inheriting from what and the types that just quickly come to mind um for me like if you have an image with multiple bounding boxes would that inherit from like just a raw image yeah so generally those kind of things will compose you know so um we uh i don't think we ever use multiple inheritance um i try to stay away from it because i've always found it a bit hairy so instead things tend to be a lot more functional so you know a black and white image inherits from image and i think a dicom image which is a medical image also inherits from image and then there are transforms with the type signatures which will take an image and then there will be others which will take a dicom image and so if you call something with a dicom image for which there isn't a registered function that takes a dicom image but there is one that takes an image it'll call the image one um and so and then we kind of use a woe there so in ways where you know there'll be a kind of um we use a lot of duck typing so there'll be like a you know call dot method and dot method can be implemented differently in the various image subclasses um and some the other thing you can do with our type dispatch system is you can use a tuple of types which means that that function argument can be any of those types so you can kind of create union types on the fly which is pretty convenient too are there parts of in the v2 that you're still not happy with or were you really able to realize that vision of there are still some parts yeah there um partly that happened kind of because of covert um and um you know i unfortunately found myself the kind of face of the global masks movement um which didn't leave much room for more interesting things like deep learning um so some of the things that we kind of added in towards the end like um some of the stuff around inference is still a little possibly a little clunky um but you know it's only a it's only it's only some little pieces like i mean on the whole inferences is pretty good but for example i didn't really look at it at all at um you know how things would work with on x for example so kind of mobile or highly scalable serving also the the training loop needs to be a little bit more flexible to handle things like um the hugging face transformers api makes different assumptions that don't quite fit our assumptions um tpu training because of the way it like runs on this separate machine that you don't have access to you kind of have to find ways to do things that um have that except really high latency and so like for tpu we kind of um it's particularly important because uh we've built a whole new computer vision library that runs on the gpu or runs in pi torch you know which generally is targeting the gpu and uh pytorch has a pretty good gpu launch latency along with a good nvidia driver so we can do a lot of like stuff on the gpu around transformations and stuff um that all breaks down with tpu um because like every time you do another thing on the tpu you have to go through that whole nasty latency so yeah there's a few little things like that that need to be improved is it important to you that your library is used um widely outside of a learning context like is it is one of your goals to make it kind of widespread in production systems yeah yeah yeah i mean because the the learning context hopefully goes away eventually hopefully there will be no fast ai course and it'll just be software so if people are only using our software in a learning context it won't be used at all um yeah we want it used everywhere or something like it i mean i don't care whether it's fast ai or if somebody else comes along and creates something better we just want to make sure that deep learning is is accessible that's super important and the funny thing is um because deep learning is so new and it kind of appeared so quickly a lot of the decision makers even commercially are people that are highly academic um and the whole kind of academic ecosystem is really important much more so than any other field i've ever been in um so one of the things we need to make to is make sure that researchers are using fast ai so we try you know and we're researchers too so we try to make a very researcher friendly and that's one of the key focuses really at the moment does that sorry i mean i would think just naively like making something research friendly would involve kind of the opposite of of making it like a single clean api like it or like uh you know abstracting away all the details like i would think researchers would want to really tinker with the the low-level assumptions yeah well that's why um that's why you need a layered api because the first thing to realize is it's getting to the point now or maybe it's at the point now where most researchers doing research with deep learning are not deep learning researchers you know they're um proteomics researchers or genomics researchers or animal husbandry researchers or whatever you know or astrophysicists you have not heard that i i was the keynote speaker at a couple of years ago the major international animal husbandry congress i got a nice trip to auckland with a family it was very pleasant in fact um hadley wickham's father organized it and he invited me yeah well i'm sorry i cut you off you're making an interesting point that i interrupted for no reason i didn't know that you were so ignorant about animal husbandry lucas i'm disgusted dude i love i love all the unusual use cases of deep learning it's definitely something i collect but that's i have not heard that one um yeah so um sorry where were we we were talking about um oh yeah researchers so you're doing research into a thing right so like i don't know maybe it's like you're trying to find a better way to do um gradient accumulation for fpe 16 training or maybe you're trying a new activation function or maybe you're trying to find out whether um you know this different way of handling four channel input works well for you know hyperspectral satellite imagery or whatever and so you you know the idea is to let you focus on that thing and not all the other things but then you want all the other things to be done as well as possible because if you do a shitty job of all the other things then you might say like oh my activation function's actually really good but then somebody else might notice it like oh no it was just throwing like a it was just doing a kind of a crappy version of data augmentation effectively because if we add dropout then your thing doesn't help anymore um so with a layered api you can use the high level easy easiest bits with like all the defaults that work nicely together and then you just pick the bit that you want and delve in as deep as you like so there's kind of really four layers uh key layers in our api so maybe you'll go in and create a new data block or maybe you'll go and create a new transform or maybe you'll go and create a new callback so like the thing about fastai is it's actually um far more hackable than um say keras right being take take what i'm very familiar with so like with keras you kind of have this um pretty well-defined transformation pipeline or tf.data if you're using that pretty well-defined set of atomic units you can use and if you want to customize them you're kind of out of luck you know often it requires going and creating a new tf up in c plus plus or something so it really helps using pi torch they kind of provide these really nice low latency primitives and then we build out everything out of those latency primitives and we kind of gradually layer the apis on top of each other and we make sure that they're very well documented all the way down so you don't kind of get to a point where it's like oh you're now you're now in the internal api good luck it's like no it's all external api and it's all documented and it all has tests and it all has examples and it all has explanations so you can put put your research in at the point that you need it i see but i guess when you talk about academics then or researchers sorry not academics you're you're imagining like actual machine learning researchers researching on machine learning itself versus like an animal husbandry researcher who needs an application of machine learning i guess you're speaking to both yeah yeah both and so i mean it's much easier for me to understand the needs of ml researchers because that's what i do and that's who i generally hang out with um but there's a lot of overlap like i found back in the days when we had conferences that you could go to um you know as i walked around europe's a lot of people would come up to me and say like oh i just gave this talk i just gave this poster presentation and three years ago i was a fast ai student before that i was a meteorologist or a astrophysicist or neuroscientist or whatever and you know i used your course to understand the subject and then i used your software and then i brought in these ideas from astrophysics or neuroscience or whatever and now i'm here i am presenting them in europe's and so there's kind of like this yeah really interesting overlap now between the worlds of ml research and domain expertise in that increasingly domain experts are becoming you know pretty well loaded and well-respected ml researchers as well because you kind of have to be you know like if you want to do a real kick-ass job of medical imaging for instance there's still a lot of foundational questions you have to answer about like how do you actually deal with large 3d volumes you know it's still these things are not solved and so you do have to become a really good deep learning researcher as well you know i think one of the things that that i always worry about for myself is kind of um you know getting out of date like i remember being in my early 20s and looking at some of the you know the tenured professors that were my age now and thinking boy you know they've just not stayed current in the state of um machine learning and then you know i started a company and i kind of you know realized that um you know i actually wasn't staying you know up to date myself and you know kind of often stuck in like older techniques that i was more comfortable with or like languages i was more comfortable with and yeah i feel like one of the things that you do just phenomenally well from at least from the outside is is staying kind of really current and on top of stuff yeah i wonder if you have any thoughts on how you do that because i well i mean i gotta say i really admired what you did with um moving away from from your world of crowdsourcing into into deep learning and i think you took like a year or so just to figure it out right not many people do that you know and and i think a lot of people assume they can't because um if you get to i don't know your mid 30s or whatever and you haven't learned a significant new domain for the last decade you could easily believe that you're not capable of doing so so i think you kind of have to do what you do which is just to decide to do it i mean for me i took a rather extreme decision when i was 18 which was to make sure i spent half of every day learning or practicing something new for the rest of my life which i've stuck to certainly on average nowadays it's yeah nowadays it's more like 80 yeah i mean it's um so for me i mean it's weird my brain still tells me i won't be able to understand this new thing because i start reading something and i don't understand it straight away and my brain's like okay this is too hard for you so you kind of have to push through that um but yeah for me i kind of had this realization you know as a teenager that um learning new skills is this high leverage activity um and so i kind of hypothesized that if you keep doing it for your whole life like i noticed nobody did like nobody i knew did i thought well if you did wouldn't you get this kind of like exponential returns and um so i thought i should try to do that so that's that's kind of my in my approach unless you reasoned your way into that choice that's amazing is it is it like a um do you do you have to kind of fight your immediate instincts to do that or is it kind of a pleasure to my instincts are fine now what you do i do have to do is to fight well not anymore not now that i work with my wife um and you know i'm working with sylvan who's super understanding and understood me in a similar but for nearly all my working life fighting or at least dealing with the people around me um because if somebody's like particularly when you're the boss and you're like okay we urgently need to do x and somebody can clearly see that like why the you like using julia for the first time to use x we don't even know julia you could have had it done already if you just used powell or python or some that you already knew and i was like well you know i just wanted to learn julia um so yeah it's like it drives people around me crazy that i'm working with because everybody's busy and it's it's hard to in the moment appreciate that like okay this moment isn't actually more important than every other moment for the rest of your life and so if you don't spend time now getting better at your skills than the rest of your life you're going to be a little bit slower and a little bit less capable and a little bit less knowledgeable so that's the hard bit it also sounds to me like just from the examples that you've given that you have a real bias to learning by doing is is that right like do you also like kind of read papers and synthesize that in a different way yeah but if i read a paper i only read it until i get to the point where i decide it's something i want to implement or not um or that there's some idea that i want to take away from it to implement um yeah so i like i um i find doing things i don't know i'm a very intuitive person so i find doing things and experimenting a lot i kind of get a sense of how things kind of fit together i i really like the way richard feynman talked about his research uh and his understanding of papers was that he always thinks about a physical analogy every time he reads a paper and he doesn't go any further on a paper until he has a physical analogy in mind and then he always found that he could spot the errors and papers straight away by recognizing that the physical analogy would would break down so i'm cut a bit like that i'm always looking for the for the context and the understanding of what it's for and then try to implement it i see so should we expect the next version of fasta to be in a new language have you thought about moving away from your python oh i mean obviously i have because i looked at swift you know um and sadly you know chris latina left google um so i don't know you know they've got some good folks still there maybe they'll make something great of it but but you know um i tend to kind of follow people like you know people who have been successful many times and chris was one of those people so yeah i mean what's next i don't know like it's certainly like python is not the future of machine learning it can't be you know it's it's so nicely hackable but it's so frustrating to work with a language where you can't do anything fast enough unless you you know uh call out to some external coder or c code and you can't run anything in parallel unless you like put it on a whole nother process that like i find working with python there's just so much overhead in my in my brain to try to get it to work fast enough um it's obviously fine for a lot of things but not really in the deep learning world or not really in the machine learning world so like i really hope that julia is really successful because like there's a language with a nicely designed type system and a nicely designed dispatch system and most importantly it's kind of julia all the way down so you can get in and write your gpu kernel in in julia or you can you know the all the basic stuff is implemented in julia all the way down until you hit the llvm sorry this is an embarrassing question julia's kind of like matlab is that what i should be thinking it was designed to be something that matlab people could could use but um no it's more like i don't know like common lisp mates matlab meets python so it sounds a little bit like r maybe um you see ah has some nice ideas but um the you know the r object system this i mean a there's too many of them and b they're all such a hack and then c it's because it's so dynamic it's very slow so again you have to implement everything in something that's not r and r just becomes a glue language on top of it i mean i spent so so many years writing writing r and it's certainly better than what came before but i never enjoyed it um so julia is a compiled language and it's got a rich type system and uh it's entirely based on function dispatch um using the type system it's got a very strong kind of meta programming approach so that's why you can write your cuda kernel in julia for example uh you know it's it's got um auto grad again it's written in julia um so it's it's got a lot of nice features but unfortunately it's um hasn't really got the corporate buy-in yet so it's highly reliant on a kind of this core group of super smart people that that started it and now run julio computing which doesn't seem to have a business model as far as i can tell other than keep getting funding from vcs which works for a while but at some point it stops what is it yes i know what is the fasta business model is there a business model the first ai business model is that i take money out of my bank account to pay for things i need and that's about it awesome well you know we always end with two questions i want to make sure we have time for that to have a little bit of consistency here um and the first one is um you know when you when you look at the different topics and um you know kind of machine learning broadly defined is there a topic that you think that people should pay a lot more attention to than they generally are paying attention to yes um and i think it's the world of deep learning outside of the area that you're familiar with um so for example when i got started in nlp i was shocked to discover that nobody i spoke to in the world of nlp had any familiarity with the last three or four years of development in computer vision um the idea of like transfer learning for example and how incredibly flexible it was um so that's what led to ulm fit um which in turn led to gpt which in turn led to gp2 and before ulm fit happened every nlp researcher i spoke to i said like what do you think you know about this idea of like super massive transfer learning from language models and everybody i spoke to in nlp said that's a stupid idea and everybody i spoke to in computer vision said yes of course i'm sure everybody does that already so yeah i think in general people are way too specialized in deep learning and there's a lot of good ideas in other parts of it interesting cool um and then our our final question we always ask and i kind of wonder you'll have an interesting perspective on this you know typically we're talking to people that are um trying to you know use machine learning model for some purpose like animal husbandry but you've sort of seen this wide range of applications and um when you look at when you look across the things that you've seen kind of go from like ideation to like deployed thing that's working and useful where do you see the the biggest bottleneck i mean the the projects i've been involved in throughout my life around machine learning have always been successfully deployed you know so i kind of get frustrated with all these people who tell me that machine learning is just this abstract thing that no one's actually using um i think a big part of the problem is there's kind of people that understand business and logistic and process management there's kind of people that understand ai and algorithms and data and there's not much connectivity between the two so like i spent 10 years working as a management consultant so all my life was logistics and business processes and hr and all that stuff you know it's kind of hard to picture as a management consultant i think he must have been a surprising consultant i tried to fake it as best as i could um for sure i've noticed a lot of people in the kind of machine learning world really under-appreciate the complexity of dealing with constraints and finding opportunities and disaggregating value chains or they'll do the opposite they'll just assume it's so hard that it's impossible without realizing there's like you know large groups of people around the world who spend their lives studying these questions and finding solutions to them so i think in general i'd love to see better cross-disciplinary teams and more people on the kind of the mba side developing kind of ai skills and more people on the ai side kind of developing understanding of business and teams and all that i guess i guess you have this broad view um you know from your background you know and and you've watched these ml projects kind of get deployed and useful so i guess like maybe the question is like more like like were there points that sort of surprised you with their their level of difficulty just to kind of move through it like did you have like mishaps where you um you know you thought the model was working and then when it was deployed into production it didn't you know it didn't work as well as you were hoping or thought it would no not at all um and i know that sounds weird but um it's just you know even a small amount of background in like doing the actual work that the thing you're building is meant to be integrating with you know i i spent 10 years uh eight years working on an insurance pricing business entirely based on operations research and machine learning but before that you know the last four or five years of my management consulting career was nearly entirely in insurance so you know there's not much very surprising that that happens i know i know the people i know the processes and so that's why i think like i would much rather see i don't know like if somebody's going to do a a paralegal ai business i'd much rather see a paralegal do it than an ai person do it or if they're gonna do a like you know hr recruiting ai business i'd much rather see someone with an hr recruiting background do it like it's it's super difficult like there's just no way to understand an industry really well without doing that industry for for a few years and what would you so like you know because i know some of these people and i get this question all the time i'll channel a question that i'm sure is in people's heads watching this so if you are that that you know paralegal who's starting you know a paralegal ai enabled business how would you do the ai part um well obviously i would take the first ai courses i mean i would i mean seriously i would make sure i was good at coding you know i'd spend a year working on coding um and yeah i mean the fast ai courses are absolutely designed for for you and i would be careful of bringing on a so-called ai expert until you've had a go at doing it all yourself because i found like most people in that situation for obvious reasons feel pretty intimidated by the ai world and kind of a bit humbled by a little bit overwhelmed by it and they'll bring on you know a self-described expert they have no ability to judge the expertise of that person so they end up bringing somebody who's just good at projecting confidence which is probably negatively correlated with actual effectiveness so yeah do it do it yourself for for a year build the best stuff you can i do find a lot of fast ai alarm with kind of backgrounds of domain experts are shocked when they then get involved in the world of ai experts and they find they're much better at training models that actually predict things correctly than the modeling experts are i'm sure you've had that experience as somebody who you know like me doesn't have a technical background in this area yeah well thank you so much this is a this is super fun and educational for me thank you very much for having me yes my [Music] pleasure you,8948 +Anantha Kancherla — Building Level 5 Autonomous Vehicles,https://www.youtube.com/watch?v=HT5UcHnAzU8,2671,2020-08-12,"As you ramp up and you grow that much, you'll start becoming cognizant of your costs. Especially if you're doing it on the cloud, they provide a lot of sharp knives. And as you play with them, you can cut yourself and put yourself in debt, in tons of money You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Anantha Kancherla is VP of engineering at LYFT where he heads up the level 5 software team working on building self-driving cars. Prior to Lyft he spent time at dropbox building products that helped teams to collaborate and before that he worked at facebook building mobile software at scale, delivering core experiences like news feed on mobile phones. I'm super excited to talk to him. I was thinking it's kind of cool the way I assume that the goal is to make Level 5 automation then? Yes, Level 5 automation aspiration but I'll take Level 4. I was actually wondering... I don't think I've ever been part of a team with such a huge singular technical ambition. I was wondering how you break that problem down into constituent parts, like how you think about what the weekly KPIs should be when you have this gigantic goal. Honestly, this is for me how I've always worked and I don't know why. I guess because I started my career at Windows and it was like a few ... by the time I left because it was about well into tens of thousands of people working there all working towards one product, with a singular focus on one product. Right. This is obviously way smaller than what Windows was, but it's the same idea. It's a good question, you know? What does it mean and how does it break down? So usually in a project like this, you're going to have so many different skillsets and ML is just one of them. Usually when people think about self-driving cars, most think about the AI part of it but the way we think about it is that first of all, you have to think about the work that we do as in two parts; the work that happens in the cloud and the work that happens in the car - the code that runs in the car and the code that runs in the cloud. For the code that runs in the car, you can think about it as, if you really want to simplify it, in two parts or maybe three parts. At the lowest level, you have the operating system. Again, if you want to compare it with a traditional development environment, like imagine you have your OS that you're targeting and then on top of that OS, there will be a runtime that you're going to write - if it's Android, Java or whatever. There will be an equivalent runtime that you want. And then on top of that runtime, you'll have your applications. That's how you would write a typical one. It's a similar idea here. You'd have your OS, that is running on the card hardware. Now, the card hardware is way more complicated than anything that you've seen on the phones or PCs. We say it's like a data center on wheels. Typically a car has a lot of computers in it, some of them we write software for, some of them come with the car and they're all in a gigantic network. Most of the code that we write runs on what we call High-Performance Compute but again, different companies root in different things so they may all factor the workload in different ways. You can imagine multiple smaller computers, or one large computer and one small computer... there are different configurations possible and you just haven't figured out how you're going to break down your workload. And then on each of those computers, you're going to run, depending on how big it is, a fairly beefy operating system or possibly even no operating system if it is a microcontroller. And sometimes, there'll be embedded processors and various microcontrollers and then on top of that, we have a framework that we build that basically enables the software components that are running on that one computer to work with each other. A good equivalent of that is in the open source world, you would have run into something called ROS. It's very similar to that but you can imagine they can also communicate across the computers on the network and then on top of that, you write the functionality that actually makes it autonomous. But then you can imagine that's just one but it could also have a calibration functionality that require a whole bunch of other little pieces of functionality that you would write. But autonomous itself kind of breaks down into your classical robotics paradigm: Sense, Plan, Act - Sensing basically is what we in our world call it, Perception. You have a block of code that predicts how the world is going to change and there's a block of code that basically figures out where I am within the world. It's called localization. Then there's another block of code; once it knows, ""This is what the world looks like. This is where I am at. And this is how the world is going to change in the next few seconds. How am I going to act? What's the plan?"" And then it sends it down to the actuators which is the control part of it. Those are all the kinds of components that work on the car. And now there's a lot of code that actually runs on the cloud, for developmental reasons as well as even during deployment, like you do. There are teams that actually build infrastructure, because even though we are part of Lyft and Lyft is very much a cloud company, the kind of workloads that the ride-sharing part of apps run is very different from the kind of workload that we run. Right. The amount of data that we collect or the amount of compute that we need is on a very, very different scale and the requirements are to be very different. So we have teams that actually think about the data part of that infrastructure, teams that think about the compute part of the infrastructure and then we also have to think about testing all of these. Testing, obviously with a unique test, you will do whatever with your code then there's also the other side of the testing, which is on the road - you build everything and deploy it; but then there's a whole lot of other testing that needs to happen in between. Simulation is one example where you try to run the software that you built, that will eventually run your car, somewhere in the cloud. And then we also have rigs that we build; we call them test builds but another term that you will often hear is hardware in the loop testing. So you build depending on which team it is. Like every team will build their own smaller versions of hardware then there'll be full system tests. So there are different types of these testbuilds that we build and you can think of them as mini data centers that we have. You run the code on those as well. We treat them as if it's like another cloud. So we have teams that do all of that and then there's teams that work on simulation.... All of these teams eventually come together at the end of the day. It all gets packaged up into software that is the part of the code in the car and then we test the car on the road, and then there are metrics used to drive work on the software that runs on the car but quite often it can also impact stuff that happens in the cloud. Like, for example, if you change your sensor and you capture a lot more data per hour. That means you may have to potentially replan your storage capacity. Things like that. Did I answer your question? There are so many more questions actually. I think you're at the point where you probably have some metric like how long you can drive without intervention that you're trying to optimize for something like that. But then do you break it down by teams of like, ""we need to make our perception 10% better"", or something like that? How do you think about that? It's kind of broken down, right? For example, there's a top-level metric for the world system performance and then... Is it like time between intervention? Is that the right metric to look at? California DMV reporting that that's what they look at. So they look at what's called MPI, Miles Per Intervention; how many miles do you drive before you have an intervention? It's a really common metric that people track but then there are so many other metrics that you have to think about. Like the endpoint latency is one example. How long does it take from, say, the time you your camera captured a frame to the time that you reacted to it? So there's a number of other metrics that matter but of course, you can argue that all of them come down into an intervention like some human had to intervene. That's kind of how generally the industry is standardized around today but, you know, it's really controversial because what is an intervention? How do you report it? It's up for debate. But then internally we track a number of other broader system level metrics and you can do two things. One is you can apportion. Let's say you do MPI, the Miles Per Intervention, you could apportion MPI to different components and say, ""hey, the reason the intervention happened was because we misperceived, the reason the intervention happened was because our map was wrong."" And apportion those. And they can go round as far as they can go round. That's it. But then really, that's just only part of the problem. So each component will also have its own separate metric. Perception, for example, we may want to track the position and recall of seeing different agents. What's the position recall for seeing a pedestrian or a car or a bus? And then you can follow something like that. How good is my perception/recall at 50 meters? 100 meters? 500 meters? So there's lots and lots of metrics that eventually break down at the component level where it comes down to every component. How do you allocate your resources? Is the perception team a lot bigger than the planning team? That's a very good question. I'd say it's roughly similar between the two teams. I don't think that there's a perfect science when you want to allocate resources. You kinda have to look at the stage of the project you are in and the majority... Because sometimes each project, each arc of the stack will move at different paces depending upon what they're building. So let's say you're doing something which is highly machine-learning-dependent, I'm talking about when you're starting from scratch. Steady state, of course, is very different. When you're starting from scratch, when you're beginning, you first may have to spend quite a bit of time building your machine learning infrastructure, data gathering, all of that. So those teams you probably want to populate first before you start throwing models at it. Maybe you can get away with a relatively rudimentary team, a smaller team of just a few code experts in the perception part. And then once that is ready, then you start putting more people in that area and maybe you don't have to work so hard to throw additional people on the infrastructure side of things. Once you start unlocking the ability to see the world, then you can start doing more and more complicated maneuvers and planning and then you start pushing more into the planning world; and then you start hitting bottlenecks on that side and then you will kind of say, ""oh, yeah, I should look at adding few more people to unlock this thing in that area."" It kind of is very, very dynamic so I wouldn't say there's one standard formula through which we do resource allocation here. I see. But is it like where you're seeing the most actually interventions being caused by, or is it where you see the most opportunity for improvement? I think interventions in a steady state world, let's say that a lot of bugs ignored interventions. I mean, if you just replace interventions with bugs. It's the same problem in any software; like rarely have you biggeest bugs. And sometimes throwing more people at the problem is the right answer. And sometimes it is not the right answer. In fact, it's the wrong answer. So you may want to figure out putting the right people into that. Maybe you don't have the right expertise. So it's not always clear that the resource allocation is directly proportional to the number of bugs that you have. I see. That makes sense. At this moment, is there a particular part of the chain that feels like the most challenging for you, that feels like there's the most room to improve? I think that if you look at the state of understanding and state of research in this space, the place where there is a lot of scope for improvement is in the area of prediction and say, behavior planning. So there's an area where there's still a lot of active development going on. The industry is changing fast. Just recently, I saw a really cool paper from Waymo's research team. So there's a lot of activity going on in that work. So I would say thats an area, which is developing quite a bit. So this is like predicting where another car is going to go whether at a pedestrian...? What will happen in the world in the next few seconds? Who should I pay attention to? What should I watch out for? You know, all the things that we as humans take for granted. Those kind of problems. So inside of the car when it's when it's operating right now, how many different models are running approximately? Oh, boy. I'm not sure I can tell you the actual number, but in terms of ML, we have so many different ways of deploying these models, right? On the car, you deploy them and then in the cloud, you have a couple of different ways of deploying them. And if you look at Lyft at large, including ride shar, there's so many different ways. Like there are times when somethings that look like just run their models on the desktop once in a while pretty ad hoc there are online loops and there are online learning that is happening, they're all running in the cloud. Then there are models that are running on the phone, and models that are running on the car. So we have models pretty much everywhere. So I guess you're responsible for the ones that ran in the car and the cloud for now. We also help the ride share team as well. So there's a few people on our team who help out, because we have a lot of pretty amazing machine learning people so we also help the core part of it. We do have visibility into like how they do it also. In fact, sometimes our teams work on the cell phone models or like the offline models. It's cool to talk to someone that's working on so many models at the same time. I'm really curious about your infrastructure for all these. How often are these models updated? It completely depends on which one you're talking about. So if you're doing the mapping ones that's really dependent on why you're using the model. So sometimes, these models are used to help the operators as they work on the new UI techniques or whatever where there's some additional assist they're providing. They made update it when that time comes. Otherwise, generally they're assisting humans as opposed to doing it on their own. So those models don't update us frequently. But the models that are operating on the card, you do that depending on what you're addressing and which area that you're trying to improve. So let's say you're working in winter and you see a lot of vapour or smoke much more visible, there'll be some parts of the code that are more impacted by all of that. So you'll see those iterating very fast. In general, though, these models tend to get trained and iterated upon on almost a continuous basis like the ones which go on the car. Did these models feed into each other? I remember when I was following models, there was a big version problem of one changes in the downstream ones need to update, do you actually then retrain everything downstream from a model if you change like an upstream model? How do you keep track of that? The models do feed into each other. That does happen. This is where I would say that since I'm not day-to-day involved in this work, I don't know the specific details about how the team manages it, but I don't think that they have to go and retrain their downstream models. But maybe you can think about it as you have your model metrics, right? So you train your model. You get a bunch of metrics around the model. But that's not enough. So you have to look at downstream metrics also. Because quite often, those tend to be the trickiest bugs also. So you bring your model, it all looks good in terms of the metrics and it's all working fine. You deploy it in the car and then you see the behaviors change quite a bit. The car may decide to break more often, or do something different. And then you have to debug that because the model behavior has impacted something downstream. So then you have to debug that. So it's not necessarily that you have to retrain those downstream models, you may just want to figure out where the interaction is happening. There are a number of few things you do have to be very careful about though. Like the validation search that you'd use to validate this model and the training set that you'd use for the training downstream. You have to be very careful of keeping them all separated and hopefully there's no overlap. Otherwise, you may introduce some of your artifacts. So those things the team has to be very careful about. Do you think it's harder with these kinds of models? A lot of people have talked about like predicting timelines is much more difficult. Have you found that to be the case or..? Predicting timelines? Timelines of improvements. I feel like software's already hard, right? But with the models, it almost seems like it might be unknowable to know how we get X% improvement. Do you give your team goals where you say, ""look, I want to see a 10% improvement on this accuracy metric""? They set themselves goals for improving it. I think ultimately it's the same with any software. You set yourself a goal. But just because it's machine learning doesn't mean that it's a new problem. The problem has got to do with the fact that you really don't know the perfect solution and you can't really estimate what it'll take for you to get to the perfect solution. So the way you do that is by series of experiments, by iteration. Because if you know exactly what to write, then why wouldn't you pretty accurately estimate the time? And sometimes that's the case like in say you say, ""oh, I need to know refactor this thought."" You know roughly how long it would take and you have test cases around, you can test it so you know all your unknown unknowns, all of those things that people get off. So you become more and more predictable over time. That's basically what I'm trying to say. Basically, what happens is that as you keep working on the problem, you start having a better idea about how long it will take because you start developing intuition about that particular area. Then you probably have unit test sort like integration tests and some other tests that help you guide and focus on the right areas and carve out the noise. Then you tend to get a lot more predictable in the work that you do. Then after that, if you change your model, you come in with a new model, you know how many experiments you need to run. You know how to scale. Then at a point, it's like throwing money at the problem. You paralyze it and you do a lot more work. But you are getting more and more predictable just because you've built all those intuition and all those collateral through tests so I would say that, going back to your question, are they predictable? Definitely not that I would say. But as they start working on it, they get better and better and more accurate about it in terms of how much they can do. And this isn't predicting improvements, like incremental improvements. I would say more like let's say they're trying to fix issues because they tend to get like more and more predictable about that. Now, they're bringing in they say like, ""oh, the circumstances with the goal of saying I'm going to improve it by X% improvement."" So the best way they can do that is by running a whole bunch of experiments and see how fast they can come. Even that, if your infrastructure is better and you have a good set of personnel that you can get incrementally better but I don't think that it's any different than any other software development that you can get super predictable. I guess one question or one thing that some people say is different, or that I imagine is different is testing these models. Before you put them into production, do you test them against a set of unit tests where it's, ""I insist that the model does this or that the car does this in this situation?"" Or is it more like an overall accuracy of I want it to make the right decision 99% of the time. How do you think about that? Because aren't these models somewhat inherently unpredictable or they're not always going to do exactly the same thing, right? Right. So the way it works is that you have a model and you will have a certain... I'm talking in terms of Perception models because if you're doing something else downstream in the area of planning, it's pretty different. So you will have certain metrics that you reached today. Then what you will do is obviously you're working to go beyond that metric, right? You can identify that as part of your model development. Like you'll develop it, you'll have the model results. But that's not enough. So you then have to do some level of integration testing. You put it all together and then you see, like, ""how's the downstream metric?"" Let's say if it's perception. The output of perception that really planning would consume is what we call Tracks. These are basically objects over time that you track over time. So you have to get those tracking metrics improved or better or impacted in one way or another in those areas. And then when you put that in the car, then the top level metrics that you have, like, how's the car behaving, whether it is driving, is it comfortable? Is it safe? So what are the metrics that you track for any of those things? So you have to get that right. So you have to go through this entire journey repeatedly. It's not you just run the model once and it works. And are you able to run these tests every time there's a new model, as you try to pass the first test and then sort of expand out? Yeah. You have to run through the entire gamut if you're doing something brand new. Do you have any interesting examples of something that improved the local tests but made that the test worse? Yeah, yeah. There's a bunch of examples. The thing is, I don't know what I can tell you. Fair enough. I mean, there are all these cases where you'll see the interaction between the model you see upstream, the perception and what happens on the planning side. I mean, I can tell you as a friend but maybe I shouldn't put it out there on the clip. But anyways, I was just giving you an example of winter; you have a lot more smoke that you have to deal with. We could see that the model performance is really good but when we integrated it, it didn't work right. And so we had to go back and see if there was some interaction going on between the upstream model and the downstream model that caused this problem. These kind of things happen all the time. So the team over time has become much more rigorous about all these things. So any time they do this, there's a lot of automation built in. They test all these things. So they have to go through the whole thing. I see. When you see teams improve models, is it typically that they've collected more different data, changed the data pipeline or improved the model itself? Do you have a sense of what...? Most often you feel like the improvements happen with the right data. Interesting, the right data. It's less often that the model architecture itself has to be changed. Got it. Yeah. But is it that it's the MLT themselves that's asking for different types of datasets? Do they control that process, or is there a data team? OK, so this is another big thing that we've been somewhat religious about it, level 5. We don't have a notion of an engineering team, a science team or a data team. We just have a perception team. I'll tell you my mental model around ML. I think ML is a skill. It's like anything you learn.. you know how to write python, great. Or you know how to write C++, that's a skill. So ML is a skill but a skill alone is not enough. So you need domain expertise. So just because you know how to write Python may be good enough for some things, but if you're trying to build some complex insurance thing, you probably need to understand insurance. So how do you divide the domain knowledge from the skill? In some cases you can't. And you see that happening like so you'll have EMs write a spec and then they'll give you something and say, go and implement it. More often you'll see that the engineer has to really ramp up and truly understand what the actual problem is because they would debug something and they have to really understand what is going on. It's the same thing in ML. So we have a team that is called the prediction team, their job is to predict. So we don't have a difference between some data scientist or a data team and an engineer. So it's the same people who have the domain expertise and have ML skills. And that's how we've been operating so far. That's cool. So all of your teams have a mix of skillsets. Yeah. So this seems to be a pretty big debate in the industry, you know, should we have a science and an engineering team idea. So the mental model I've come up with is Job of our Science; is to develop knowledge, the art of what they produce, the production is knowledge and the job of an engineering team is an artifact. In most cases, we are actually building an artifact. We are building a product. So in each case I see that the science vs engineering divide to be less germaine in these areas. You can have a research team. Their job is to produce knowledge. And that's OK. But when it comes to developing a product, I've always found that it is better to have the domain knowledge and the skill people together. And I believe if you can find the unicorns in your fold, that's awesome. But we have very few of those, but then we kinda have to bracket them with people with the right skills. Does it make sense now? Totally. I mean, we see the same thing with a lot of the companies we work with and if I was in charge, I think I would lean in the same way as you've, make sure that the people doing ML are right inside the teams that are actually trying to accomplish something right now. And now coming back to the data question that you asked.. Yeah. So if you are a domain specialist, you already have a very good intuition about what is the data that you want. And like we just said, most of the problems seem to be about finding the right data, then the right model. So you have this nice property where the team just knows what is the right data to seek. Got it. That makes sense. Is it challenging for you to deploy models into production? Like I've never had to deploy into hardware, is that a challenging step for you? Do you find sometimes the model doesn't perform the same way you expected when it's actually inside the hardware? I think there's difference between when you're building it for like a cloud service versus what we are doing here. Is that the model? There may be a transform after training, right? Generally, on the transform after training you could say quantization or something like that. So you have to develop a good understanding about that, about the impact of that on the model. The other thing that becomes really important when you're deploying, and this is no different for mobile apps, when you're deploying models in your mobile phone, is that you are really, really careful about power and latency. So you have to really be rigorous about your upcount, how much time does it take? So all of that you have to think about. Do you build infrastructure to actually monitor these models as they run in production. Yes. So actually we try... We have an internal framework that in fact, I was just watching the video just before this. They were doing our demo because they'd just built a new one. And that takes get all of these for you. So it would do the stacks and everything, like when you are building and bringing your model and running your experiments. And in fact, we dump all of that probably in your tool. Cool. Some people talk to us about worrying about future drift. Would you notice stuff like a sensor broke or something, or if the model's getting a different kind of data, is that something you look for? Or is it mainly just like latency of the model? Oh, I see. So you're talking about the model. Some strange behavior... Yeah some weird situation where it seems to be struggling. So it could be so many problems. Yeah. It could be some sensor has gone bad. I was just thinking of something specific here. Yeah, those things happen. But the way we find out a bunch of these things is that increasingly we depend on something we call unsupervised metrics. Like, what's the rough size of a bicycle? Yeah, like a meter or two meters. No, it's. I mean, probably a lot more than that, say three or four. But if you see a 50 meter bicycle, then there's probably something wrong with that, right? Just doing that as a very extreme example but you can imagine that there's a lot of such heuristics that people put together and track. And if you start seeing weird things happening or that, that enables you to catch lots of crazy bugs, and sometimes that's a really good way of catching long deletions as well, because it may not result in a disengagement, but you may see some weird behavior or it may trigger a disengagement but it is probably not often enough that that's an important problem to focus on. So increasingly we depend on all these unsupervised metrics that the data comes in, the new compute, all these various interesting statistics and the new figured out what is actually going on. And then you go back and resolve it. It's funny I was just looking at Jira tickets for my company... if you see, like one thing wrong, does that warrant you a ticket? Like if you see one bicycle that's too big, will you actually file a ticket against that? You should. I mean, all of these things should be like filing a ticket. If it is that glaringly obvious, and you have the time to do it, to take a look at it, you should. So give us only one example where it seems wrong and we're gonna take a look at it. Yeah. Again, this has nothing to do with to self-driving cars. I mean, we used to have similar problems in windows. You know, like there'll be some weird one off thing that you saw and we would record it. And then next thing you know, if it's some old changes that suddenly things starts to pop up and then you're like, ""oh yeah, I've seen it in these environments and situations."" And then you kind of are like ...So it's important to anytime you see something anonymous, you just find it. And hopefully, you have more context that you capture and then it'll help you debug. Do you have a team that's tasked with looking for that or is that kind of everybody's looking for those things? So we have a team that...obviously a lot of our reports come from the drivers driving the road. But then we also have to have additional people to go back and look at the data and see if there's something weird going on. They're not necessarily engineers, we call them Operations. So they scan and take a look at these things. And, of course, engineers also you know, look into these interesting cases. And then we actually look at it as well. But there's so much data coming in that, ""which one do you look at?"" and ""how do you prioritize?"" That really becomes a more interesting problem. It sounds incredibly challenging. Yeah, yeah, yeah. I mean, I believe that this is a problem of any software data scheme. So in our case, I mean, I've heard again that there's a lot to work on, major products that operated at scale. And it's the same problem, whether you're running newsfeed at Facebook you're running some issues in Windows or you're running a car on the road for thousands of miles. You get lots and lots of reports and that's the issue of diversity. Yeah. I guess these are the sort of issues for complexity and scale. Yeah. It's a complexity and scale problem. These are extremely simple problems. You have a sanitized test track and you're running your car and it's probably you can be very selective about what you do and be really rigorous. But when you're running it on the road, anything can happen. And it's like you run your operating system on any kind of PC anybody will tell you you still haven't figured out what's happening. So you've been at Lyft for almost three years now? Yeah. And the organization must have grown quite a bit in that time. I'm curious about how process have just changed as the organization has grown and things have solidified. A lot, right? It's interesting. There are teams which were nonexistent and then they've been built and then now they've gotten to operating at scale. I would say the current Perception team is one of those which is now operating at scale. But then there are still some new teams that are forming and it almost feels like they're doing things which were, say, a perception team or some other team was doing it at the better beginning. Of course, they have a lot more guidance now because there are other teams that are out looking for them and they get through it. A few things happened. One is as you get bigger and operate on a wider scale... In the beginning, we would not care about where the training was happening. The engineer would train it on their desktop; the workstation that they had, it could be like that. As the team started growing, more engineers started coming in and the reproducibility and all of that started becoming a real problem because multiple people are working on the same thing. And so then you start becoming much more rigorous in your process, and that's fine. But it would work only if you have maybe four of them working together. But once you go beyond that, process won't fix it, mutual agreements won't fix it. So you probably should be building a framework to help you standardize that process and just make you not worry about all the moving parts. Then after that, you'll find that the framework doesn't last you and then you'll write a new framework for the new scale of problems that you'll run into. And we've gone through all of that. Then another thing that will happen is as you ramp up and you grow that much, you'll start becoming cognizant of your cost. Especially if you're doing it on the cloud, they provide a lot of sharp knives. And as you play with them, you can cut yourself and bill yourself for that, in tons of money. And so again, you start becoming very, very careful and you try to build your ML frameworks or whatever, maybe it's not just ML, but even simulation. You start building your frameworks to help keep that in check. Then you start becoming very, very rigorous about your data partitioning. And you also have versioning and you track all of them and make in a tool. And you probably want a custom tool, which is kind of how we ended up building our own analytics tool internally to track a number of these things. And then you start getting even more electro-track all your experiment, and then you begin to use Weights and Biases. And then you start running into time problems. You know, you do more and more complicated models and then you want to get done with your experiments faster and then you start getting distributed training. And, you know, you've gone through this entire journey... I'm sure there's more. Yeah. What's next? I'm sure there's more. I'm sure if I talk to somebody at Google, they will probably have gone even further in the order of things to be done. Thanks. That was well said. We always close with these two questions. I'm curious how you're going to answer these. You may be very specific, but maybe you can expand it to autonomous vehicles. What's one underrated aspect of machine learning or AV's that you think people should pay more attention to than they do? I noticed that there's a tendency to think of it as just a skill, and it's like you throw it at your brain and it gets better and you can do that over and over again and then you may be able to get a good result. But I always go back to the idea that one very underrated aspect of machine learning is that it has to be coupled with domain knowledge. You really have to have a good understanding about what problem you're solving and have a good understanding of the domain. In fact, I would say spend quite a bit of time really understanding the data that you're going to get. And then because I said the right data is more important than a lot of data, actually, there was this interesting case where we made some change and we cut down our data usage by half, it became way cheaper for us and our model became more accurate. So that's what you get by actually genuinely understanding. And I would say I don't care too much about their domain knowledge but I think that's something that I would say is very important in this area. Especially with machine learning. But you could argue it's true for any anything but. Yeah. Yeah. All right. Well said. Today's last question is, you've actually deployed several serious machine learning products at scale, what's been the biggest challenge in taking them from the experimental stage to actually running in production? So one of the biggest problems with machine learning is that get into generalize right. So there are a lot of tail events and the beta is typically really sparse in your dataset and trying to figure out why that happened, what happened is generally really difficult. So this is one area that I can think you have to figure out how to combine machine learning with other techniques. You know, a place where you want to have absolute guarantees in a system like a robot where there is actually no human intervention. In other areas, I think there's some very nice properties if you're doing a human assist. So ML has a really nice property of being able to do good enough. And let's say you get some 99. ..some number of lines and then the remaining can be like augmented by human intelligence. But if you really are trying to do a perfect build a robot, then it has to really be completely autonomous. You have to figure out additional ways in which you can have some guarantees. And that's actually quite challenging. Figuring out everything. Seems challenging. Yeah. Awesome. Thank you so much. It was a real pleasure to talk to you.",7115 +Bharath Ramsundar — Deep Learning for Molecules and Medicine Discovery,https://www.youtube.com/watch?v=GnkpVjp117k,3311,2020-08-05,"For a lot of these things, it's actually really easy to make something poisonous. And as governments, as the industry has grown recognition to this fact, you just have this recurring thing that all of a sudden, you invent a miracle, something or other, oh plastics. Plastics were thought to be the wave of the future in the 1950s. They're also a type of just a molecular product. And now we find out that they choke Seagulls they choke baby turtles. There is microplastics everywhere. I think this is a type of generalized toxicity issue that we realize if you make large quantities of a new substance, that the world broadly isn't prepared to digest. What happens is 30 years down the line, you're like, oh, crap, I killed off the trout. I killed off the eagles. So it all comes down to the fact that I think, you know, living systems are extraordinarily complicated, and making something that is tested and safe for a living thing to interact is actually very challenging. You're listening to gradient dissent, a show where we learn about making machine learning models work in the real world, I'm your host, Lukas Biewald. I'm especially excited to talk to Bharath because he created the DeepChem open source project, which we've seen a lot of our customers at Weights and Biases using and it seems to be the most popular library for people working on deep learning applied to chemistry and biology. He also made an open-source dataset called MoleculeNet, which is a benchmark suite to facilitate the development of molecular algorithms. He got his Ph.D in computer science from Stanford, where he studied deep learning applied to drug discovery and is the lead author of Tensor Flow for Deep Learning from linear regression to Reinforcement Learning. I'm really excited to talk to him. It's really exciting to talk to you. We've been seeing a lot of customers come in doing drug discovery and other medical applications, and it's something that I'm not super familiar with but seems incredibly meaningful. We've got a chance to talk to a whole bunch of our customers and ask them what they're doing. And one thing that keeps coming up is actually the DeepChem library that I think you're the original author of. So I really wanted to start off by asking you about that. What inspired you to make it, and what problems were you trying to solve? Yeah, absolutely. First of all, thank you for having me on the show. I'm glad and excited to chat as well. Lots of folks I know have been using Weights and Biases to train models and track experiments so I think it should be a fun conversation, I hope. A few years ago, basically during my PhD, I did an internship at Google where I used their mini-computers to train some deep learning models for molecular design broadly. But I think what happened was, as with all good things, the internship came to an end, and I had to head back to Stanford and then I found out I no longer had access to all that code. I couldn't really reproduce my results. So I think the starting point was I just wanted to reproduce the results of my own paper. And I think to start basically was just a few scripts in Theano and Keras at that point. Then I put it up on GitHub, I mean why not? Then a few more people did start to use it, then it just sort of grew slowly and steadily from there. I think the original aim of DeepChem was really to enable answering questions about so-called small molecules. So most of the drugs that we take, Tylenol, your Ibuprofens, things like that are all small molecules. But over time, I think pharma has actually begun to shift off it and so now there is newer classes of medicines. There are of course things like vaccines. So nowadays, I think DeepChem is slowly trying to grow out to enable open-source medicine discovery across a broader swath of modern biotech. So that's just a little bit about the project. I think there is a very active community of users. There's a number of educational materials and tutorials built up around it. I think it's also that a lot of medicine discovery is quite proprietary. There is biotech things that we often see their advertising material like in our proprietary algorithm, our proprietary technique, which has worked fine for the industry for a long time. You know, that's the way most medicine we know was discovered. But, of course, as we know in tech there's just been a shift, in that open source is increasingly a foundational part of the way we build companies, we discover things. So I think part of the goal of DeepChem is to bring some of this open-source energy to the biotech drug discovery community and enable more people to be able to share in these tools. It seems like you've definitely been successful at that. I mean even before I knew of you, talking to folks at Genentech and GSK and I would say, over half of the conversations I've had with pharma companies have mentioned DeepChem, I thought it was pretty cool that they are using the same platform and contributing IP. I didn't know that pharma did that at all. So that seems really wonderful. I think it definitely is kind of a new shift in thinking. But of course, you know, pharma has seen the fact that TensorFlow is open source, PyTorch's open source. So I think it is the beginnings of a shift. At the same time, I think IP considerations definitely do matter a lot. So I think a lot of folks find they can't contribute at some places, which is fine, I think it's just a policy. But there is still a culture of caution around potentially releasing valuable IP. But I think what helps things a bit is there's this recognition that oftentimes it's the actual data that's the core IP. It's not necessarily the algorithm that's just calculus. And so I think there is some favorable shifts in the industry, but it's definitely something that's only beginning to happen. So just taking a step back, because I think not everyone necessarily knows the field at all. I actually didn't, till maybe six months ago when we started to see our users doing this. What's the canonical problem here that Pharma is trying to solve? Yeah, I think it's a great question. At heart, the goal really is to design medicine for diseases you care about and the reality is this is an extraordinarily complicated process. And I'd say even now, machine learning is only useful for 10% of it. And the task here is that say you identify a disease, then you want to find a hypothesis for what causes the disease. Maybe there is a protein that somehow has become misconfigured or mutate in the body. There can be a whole host of disease-causing factors but you oftentimes try to take a reductionist, you narrow that down to one protein target. So you say that if I somehow could change the behavior of this protein, I could potentially cure this disease. It's a hypothesis. It might be right, It might be wrong, but it's a good starting place. Then you go out and you say, now I know this protein. Can I find a molecule that causes it to have some interaction? So there is a few mental models for this, you can about it as a lock and key, you can think of it as basically an interacting agent that comes in, the drug, that is, and shifts the behavior of the protein the way that's favorable. So the goal computationally at a crude level is that design the molecule, given the description of this problem, print out the ideal molecule for this. Now, the reason this gets challenging is that the ideal molecule is extremely hard. I think one of the hardest problem here is that there's this question of toxicity. I think the silly example for this is if you want to kill cancer cells, you can pour bleach on them. You can't drink that bleach - that's going to kill you too. So a lot of medicine is pretty indistinguishable from poison. It's really targeted poison that goes after one particular part of the body. So when you're designing medicine, you're often just struggling with this challenge, if you're on this very razor-thin design edge of between poison to medicine. You also often don't have a precise model of whether the potential drug works or not until you try it in real patients. So you try to make proxy models for this. Traditionally you'd have something like a rat that has some variant of the disease, or sometimes it's things like cats or even dogs but when you think it's safe, you then try it out on real patients. So this is kind of the clinical trial process: Phase One, which tests toxicity. Is it safe for humans? There's Phase Two that has efficacy. Is this actually showing effect in a group of patients I'm trying this on? And then Phase Three is basically ""OK. We think there is effect, let's make sure on a big trial with lots of people."" And occasionally there are things like Phase Four, which is after the drug is being used by real people, let's do more studies, understand the real effects it's having on patients so that we can give better guidance to doctors. So I think the heart of the challenge in applying machine learning here is that we are dealing with a lot of unknowns. We don't know precisely why things become poisonous. We know some of the reasons. But oftentimes you'll get these strange factors that crop up. We don't know if a potential medicine actually treats the disease in question until we try it. Just to slow down for a second. I think it's not even obvious to me necessarily what the machine learning problem is within that. What's the input data and what are we trying to predict? That's definitely another great question. And usually, the challenge here is that you start with a very narrow sliver of this problem. So there are, say, limited models for toxicity that given some amount of data, you create a database of compounds and you're like this molecule induces a negative effect. You can train a machine learning model that given the structure of a new molecule, will predict an output, which is the toxicity label. The challenge, of course, is generalization. You know it works on your training set but if I give you a new molecule, does it actually work? That's often the question. That it's very hard to gauge that. And then how is it possible.? Sorry, there are some questions I have. How would you possibly have enough training data? You're not going to keep poisoning cats to keep finding more and more poisonous molecules, right? How does that work? I think there's another great question, and the real answer is we don't have enough training data. Which is why I think molecular machine learning is a bit of an art right now. Unlike images and speech where there are these dramatically larger training sets, the datasets are fundamentally limited. There are a few approaches people take to deal with this. I think one common theme is let's use more of the fact that we know a lot about physics and chemistry. Toxicity, I think is a very hard problem; it's biology. It's kind of harder. But in many cases, you'd say that ""well, okay, I know something about the molecule. I know something about its invariances. I can encode that into the convolution network."" So now you have increasingly sophisticated graph convolutional networks that encode more factors of known molecular structure. It's definitely not a salt field. I think this entire part of machine learning is far from what I call the image net moment, there is that point in which the thing just crosses over and breaks out and I think right now it's useful, but it isn't that magic bullet in this order. I actually really would like to go back to that but I want to make sure I understand the core problem here. So it sounds like you have a molecule and you want to predict some kind of property? I think that is definitely the most common one. There's a number of variants to this. Like you might have a protein, then you want to find a molecule that interacts with it. One way you can do this is, does the property interact with the protein? There is also generative models where you say that okay, given a database of known drugs use an LSTM or something to just print out the new potential drug. This tends to get a little hairy. It's kind of hot research, but it's not safe to really use in production. I think there is some reaching academic debates about that right now. Alright. Sorry, could I ask some more dumb questions? How do you even represent a molecule? Text seems kind of obvious to me but I mean, it seems like molecules have a variable length and they have some structure. Is it a graph? It's actually a great question. Thankfully there's the field of chemo informatics where a number of years ago they defined a thing which is called SMILES, S-M-I-L-E-S. So SMILES strings are basically a language that allows you to write down molecules. It's most often used for small molecules but you can write pretty big arbitrary molecules as well. Many architectures take the smiles and do convert it into a graph. And the idea is that the atoms in the molecule turn into nodes in the graph and bonds usually turn into edges. Although sometimes you do something like a distance cut off because there's these non-covalent interactions. So you might say all atoms that are close to each other are now bonds and have edges in my graph. And does that completely represent a molecule? Honestly, not at all. The real molecules are these very complex quantum beasts that have orbitals and extremely complicated wave functions. In fact, I'd say that when you get past really teensy molecules like helium (there's probably a few slightly more complicated ones), you actually don't know the quantum structure of these things. Until the quantum computers arrive and we can run these simulations, we actually do not really have the ability to grasp the ""true structure"" of a molecule in most cases. So it's an approximation. It's mostly useful for many purposes though. But yeah, molecules are more complicated than we understand. In many cases. So when you talk about an LSTM generating a molecule, it's generating, literally generating, a string that gets interpreted as a molecule? Exactly. So the smiles language I mentioned, precisely what you do is that you just treat it like a sentence generation task but you're generating in the smiles language. And oftentimes the challenge there is that if you do this naively, you'll generate grammatical errors. So it's not an actual molecule but there's been a lot of research by some groups at MIT in particular and UToronto, that have worked out ways to constrain the generative models so that it's more likely to generate real molecules. So I guess this sounds, you know, as an ML person, this sounds incredibly appealing, right? Like a kind of well-formed tricky ML problem that has the potential of saving lives. And I guess I wonder how much of this is real and how much of it is speculative? Can you point to an example of a drug that was created through this process or helped by this process? So absolutely not, unfortunately. So this is kind of where it gets really fuzzy. Is it some on average like, you know, I think Covid might actually speed up discovery in some cases, but most of the time it's like 15 years from the first discovery, starting a project to like the actual getting to patients. So there have been simpler computational techniques in use for decades now. So there is some degree of evidence that they help. But I don't think there's been a smoking gun. There isn't like one molecule they can really point to and say that and AI made that. And I think it's more like, you know, the process of using this program helped, you know, in some fuzzy, hard to quantify fashion, the design of this compound. But it seems like the programs are kind of suggesting, or at least the framing that I hear from a lot of our customers is the programs are like suggesting compounds to try. Which makes a ton of sense, right? Because you have to try something. So I assume that people have some non-random approach for this. It seems that there must be evidence now if these deep learning techniques work better for this kind of suggestion than other techniques. That seems like pretty quantifiable. or am I missing something? So. I think part of the challenge here is that it's hard. There is like many steps in the process. So there is a paper from Google recently where they showed that on one particular task that, when they ran the experiment, they say naively was like a few percent hit rate. That is like things that actually like looked like they might work in that stage. And when they bootstrapped it by training and machine learning model, then make predictions it was something like 30 percent. And, you know, that sounds like a giant boost, but I think that's like one step out of like 20 in the process. So, you take the thing that comes with that and you go to the next stage where you are like, well this molecule's good, but it turns out that it gets caught up by the liver. We need to, like, change it somehow so that it avoids that. And right now, the best way to do that is still to hire a seasoned team of medicinal chemists who can guide you through that process. In the later stages, it gets particularly gnarly because you have very small amounts of data. So like the Google paper, it was at an early stage where they could generate programmatic large datasets, like 50 million data points or something. But in the later stages, you might have like a hundred. And then also you are in that fuzzy no man's world in which machine learning is kind of witchcraft. So I think that's part of the reason. Because maybe you started out with something that was AI-generated but then 10 medicinal chemists came along, tweaked it here tweaked it there, then what do you have at the end? And honestly, we don't know, like, I think 10 years from now, maybe there will be a molecule we can point to. But for now, I think it's so fuzzy. It's kind of interesting that you said I mean, I totally resonate with the ImageNet moment, because I definitely remember the ImageNet moment for vision, because I ran a company that was selling training data and suddenly, you know, everyone flipped from wanting text training data to images because suddenly all the image applications were working. But I guess what was kind of interesting was that I actually feel like the ImageNet moment came a few years after ImageNet, like not only did we see vision starting to work, but it took people a while to realize it. And then companies started to staff up. And now, you know, I can go on Pinterest and click on stuff and buy them right away. Or I can find out my baby photos on my iPhone. But like, it seems like this one, the medical companies have kind of staffed up maybe before it's clear that it's working. Because it does seem like deep learning is now important to basically every Pharma company. I mean, it seems like this could be set up for a real serious disappointment also. I think that's very kind of insightful as an observation and I think you're totally right. I think if you talk to a former veteran and they'll talk, there's like this old Fortune magazine from 1980 where they had some pictures of molecules on a computer and they said it's going to be like medicine on the computer, It's going to change everything. And of course, nothing changed. And I think, you know, even for the Human Genome Project, there's a lot of hype. You know, people thought, having access to the genome would change everything. But I think the recurring theme of biology is that billions of years of evolution always have more tricks behind them. So I think you're right. I think deep learning is a useful but not magical tool in the space right now. And I think that in some cases that disappointment has already hit people. I think in other cases still, my hope is that people stick with it because I think these techniques do have a lot to offer. But I don't think it's going to magically cure cancer. I think it'll be one useful tool in the scientist's toolkit to discover medicine. But what do you think caused people to feel this optimism because machine learning techniques have been around for quite a long time. And I presume people were trying these on the same datasets. Like, is there something special about deep learning that it sort of feels more promising in some way? It's a great question. You know, I think, you know, we all saw this amazing wave of just deep learning hype. Because I think that ImageNet moment spread out into these other fields. And I think people started hoping. I think there are some genuinely new advances that deep learning on molecules has engendered. For example, the more predictive models, when you have enough data, they actually start working considerably better. This Google paper I mentioned a while back, it actually gets like a considerable boost over a simpler or random forest or something because it has enough data. The generative models, they can sometimes do clever things. So I think there is some substance, that sort of paper. But there isn't that. I think there is the hope that it might lead to a breakthrough. And just speaking for me personally, when I started working in this field, I didn't really understand any biology or chemistry, I think 9th bio classes, was my last formal training in the subject. You and me both [laughs] Had a good 9th grade bio teacher but yeah I think when you come in, you're like, well, you know, tech can solve many hard problems, like why can't it solve this? Why not? And I think the answer is evolution has had billions of years and that just builds up irreducible complexity sometimes. So I think it's still hopeful. I think there is real potential and value. But I think also once you can spend some time in you get some humility, the scope of the problem is much grander than you. At least I first realized when I was coming into the space. But yeah, I think it's just a hype train got ahead of the actual technology and then it's like the Gartner hype cycle. I think now we'll end that trough of disappointment and then that slope of enlightenment coming up a few years from now. Interesting. People seem fairly optimistic for a trough of disappointments. It is an interesting perspective. Yeah maybe we're still coming down, I hope not. One problem that I've always found in health applications is missing data. Like, are there data sets like ImageNet for these kinds of applications? So honest answer, not really. So I kind of started a project called MoleculeNet a number of years back in grad school, along with kind of one of my coauthors. And our intent was to gather as many datasets as we could to try to make something like ImageNet. And I think the honest answer is we helped a little bit. I think there is a useful collection of data and benchmarks we put together. But the challenge is that molecules are non... So I think in computer vision, I think object detection, object localization don't cover all vision tasks. I think their is some hard frontier of problems still. But you get like a pretty big chunk of them. In molecules, it's more like there's just an entire range of things people want to do with them. You have a little bit of data for each task and the tasks are often not latent. So if you take like a quantum mechanical dataset, you'll find that very different featurization and algorithms actually work better than if you take a biophysical task or a biological task. So I think there is a reasonable amount of data in aggregate. But it's for different applications and you can't easily blend it into one ImageNet style mono data set yet. Interesting. It kind of reminds you of natural language processing with all of its different applications. I think there is a dream that maybe we can figure out some type of universal pretraining that akin to the GPT2 models or to like actually does get you to that universal molecular model. I think as of now, we haven't achieved it, but maybe it's so crazy to think that we can. Like, we do know that Schrodinger's equation at some deep level is pretty close to a, leaving aside relativity, it's the best known model of these molecules we have. So maybe if the quantum computers will eventually help solve this. But it's a ways off for now. Interesting. And the experiments presumably are kind of expensive to run now. Yeah I think there's the rise of mail-order services, things like Enamine or Muchi where you can pick out a molecule out of a catalog, then they'll make it for you and they'll ship it to you. So it's a little easier than it used to be. You don't actually need to be a bench chemist at the same time you do still need to run an experiment. So oftentimes people will say use Enamine to buy it and they'll use a second contract research organization to run the experiment and they'll just keep track of quality control. So it is possible to do it, you know, not quite in your basement, I think. But maybe in a well stocked garage where you can carefully coordinate many e-mail threads or something like that. But, yeah, it's expensive. It'll put you somewhere between a few hundred to a few thousand dollars per compound depending. We have a whole bunch of customers that are startups doing this type of thing, how do they hope to kind of compete with bigger companies when they don't have access to these datasets? That is a great question in many ways. Maybe I'm not the right person to ask because I didn't found one of these startups.I think there is some advantage to coming at it with some new eyes. I think when you're a very big company and are trying to introduce just a shift and thinking. There is of course, a lot of cultural inertia. Traditional startup versus Bigco dynamics. I think there is some potential to pick up kind of interesting potentially looking fruit that just people haven't looked at. I think there is also some eventually, I think, potential for mergers and acquisitions. I think building a talented machine learning team can be difficult. And I think if you have a company that has succeeded and has shown some promise, maybe it's a good acquisition target. So I think there are fruitful paths forward for many of these companies. I think some of them are actually aiming really high. They want to be the next Genentech. And I think it is possible, but I think that might end up coming down more to your biologists than it does to your machine learning people. And perhaps I'm a bit of a pessimist on that front. I think core biology, the really foundational stuff, is still beyond our current machine learning and AI techniques. I think it's beginning to change as you get more genomics data, more kind of biological material that you can feed into machine learning models, there's a lot of companies at that frontier. But for now, I think it really is that if you have a crack team of scientists, that might take you further than a crack team or machine learning engineers. Ideally have both, and then you have the best of all worlds, Though it just seems like the data collection process is so hard. It seems you might need to innovate there, too. I mean, I'm coming from my own background of data leveling. It seems so daunting, the idea that you have to order molecules somehow and run a wet lab. I guess again, I have a whole bunch of different questions. One thought I have I guess is probably like the dumb things that people think of when they first hear about this stuff. But it seems like if you could model things about molecules, that's so powerful. That's like the stuff everything's made out of. Like, there must be applications besides biology that might be simpler. Is that is that true? I think absolutely. Now, unfortunately, the challenging part of some of the most interesting applications are in places like batteries. So I think there are kind of other fields. Like, for example, the crop protection industry. So if you make pesticides, herbicides, fungicides, pretty similar techniques, Really?. I guess they deal with the properties of molecules. In fact, this is kind of coming back to that thin line between poison and medicine. If you actually take a look at some pesticides and you look at them, it kind of looks like the same small molecules you have in medicine, which might explain a few things about the world. I think there's also other applications, in industrial applications, probably in petrochemicals even. I think there is a bit. So there is absolutely kind of other cases. But, I think we in the software industry are sometimes used to working in our world of bits. Whereas I think when you get into these industries, you're like, at the end, you have to make something and I think there is that slowdown. I think maybe batteries is actually the hardest. Pharma's a little behind that. I think some of these agricultural applications are a little easier to get to market, but still quite daunting. I think in general, it just kind of comes down to like for a lot of these things, it's actually really easy to make something poisonous. And as governments, as the industry has grown recognition to this fact, you just have this recurring thing that all of a sudden, you invent a miracle, something or other, oh plastics. Plastics were thought to be the wave of the future in the 1950s. They're also a type of just a molecular product. And now we find out that they choke Seagulls they choke baby turtles. There is microplastics everywhere. I think this is a type of generalized toxicity issue that we realize if you make large quantities of a new substance, that the world broadly isn't prepared to digest. What happens is 30 years down the line, you're like, oh, crap, I killed off the trout. I killed off the eagles. So it all comes down to the fact that I think, you know, living systems are extraordinarily complicated, and making something that is tested and safe for a living thing to interact is actually very challenging. What about other medical applications? I think you wrote a book on this. Right. So, like, what are the other categories of things? And I guess, I'd be curious to your take on like how promising they are, it sounds like it's hard to separate the hype and you've probably thought deeply about this. I definitely think there is a whole host of really promising applications. I think to name two, I think microscopy is going to be completely changed by ConvNets. This is one of those magical places where ImageNet works, you can actually take an ImageNet model and stick it on top of a microscope and start doing pretty sensible things pretty quickly. What's an example of a thing that you might do with microscopy? One of the kind of interesting things about this field is that you can pick up a lot more out of a microscope than you could have thought. So there are some really interesting papers that show that oftentimes like, so there's some say readouts of a cell, where traditionally you had to kind of destroy the cell, blow it up in order to get at it. But people have started to show that you can instead get a dataset where you take the original cell, then you blow it up, get the read out, then you can train the machine learning model to start to input that from the raw cell so you can potentially get non-destructive readouts that enable new things. This is kind of more basic science. Like it's not clear what the downstream effect is. There are a number of companies, I think, Recursion Therapeutics is a prominent one that has been using microscopy and machine learning broadly to do phenotypic screens. Earlier, I mentioned you often pick a protein target. Which I did to slow down for my 9th grade biology. Phenotypic screen is what? My apologies. No no, I know that phenotype, it's like the expression of a gene. Is that right? Yes, exactly. So I think one way to think about it is maybe bottom-up design versus top-down design. So kind of the targeted drug discoveries may be bottom-up. You say the human body is complicated, I'm going to be a reductionist, I think this is one magic lever and I can switch that lever on and off. I can really change everything. And that's kind of, you coming from the bottom and then you hope it makes it all the way to the top. The other one, which is actually the more traditional way of finding medicine, is like, you know, some really smart doctor. This is like the penicillin story, notes some effect, you have no idea what the effect is caused by. You don't really understand the intricate biophysics, the chemistry behind it. But you see it, maybe there's something that you just observe. I think this famous case of penicillin wasn't the mold on the bread. But I think for a phenotypic screen like the ones Recursion do. Basically they have these cell-based acids where they grow cells in a petri dish. And essentially they test, you put a little bit of medicine in there and then you see how the cells state changes and use the microscope and the deep learning system on that to pick up those changes. So you can do this very rapidly. What would be an example change? Is that the cells are a different shape? That's a really good question. I think it often depends on the disease in question. So like a common thing, say for like cancer is that, the silly one is can you kill the tumor cells? The hard part there is can you kill it without finding bleach? So that's something that's a medicine. I think, for other readouts really depends on the disease. I think the general point there is like diseases are complicated. So there are many proxys people use. So kind of the hierarchy of proxy's is if you have a pure test-tube, which is molecules, that's like the weakest, if you have cells, that's a little better, if you have a rat, that's a little better; but I think the gold standard of course, is like the human. So you can think of this as like it's better than the pure test-tube, but it's absolutely not the same as a human, it is a useful kind of proxy. So, okay. So what the method with the machine learning does is kind of find properties based on the images from the microscope? The way I like to think about it is that machine learning is kind of like making a better microscope. So in many ways, if you go back to classical signal processing. We have all these, you know, Fourier transforms, you have high pass filters, low pass filters. And these, you know, traditional signal processing techniques made things like microscopy even feasible in the first place. Well, you have purely kind of optical microscopes back in the day. But in the last century, I think there's been a lot of signal processing attached to it. So I think of deep learning in these applications as signal processing, turned up to eleven. And so you can pull things out of the image for which there is no obvious way to write down that function. So I think right now it's more like this really fascinating scientific thing, you know there's got to be something there. But I want to make sure I'm like, picture it, like I want to have a mental model. So, like, maybe that was evocative of like, did I kill that tumor cell? So is the point that like the machine learning could tell me if the tumor cells were killed without me having to actually look at it? or is it that the machine learning, like sees something deeper that like I couldn't figure out if I looked at it. So I'll have to apologize up front because I'm not an expert at cellular biology but I'll try to. So, for example, I might be making this up, so if there are real biologists that eventually listen to this, please bear with me. No, it's a machine-learning audience, you can pontificate. By the way I think machine learning people will be really familiar with the idea of just looking at results and not worry about the process behind it. So I feel like this is very appealing to our machine-learning audience. You know, I do have to say I still have no way idea about what happens deep in layer 37 of my ConvNet. Imagine you have a muscle cell and you can often measure like the stretchiness of the muscle cell. There is often ways to kind of guess that a proxy for healthiness. I think the actual thing you measure depends a lot on the biology of the system. For example, like one common thing is that there is these things called fluorescent reporters and you can engineer the cells so that if you have the drug and it actually hits something in the cell that you know about, it sets off light. Here, It's you have to know a little bit about what's happening inside the cell. You have to have a guess already. I think the cruder version might be, you know, you have this muscle cell you're looking at. You know, maybe there's some measure of how stretchY it is. Oftentimes it's just like kind of obvious to the eye. It's like that traditional, you know a dog when you see it. You see the healthy cells, they have some, like, nice geometric shape, it looks good. And you see, like disease and they're all like shriveled up and just looks bad. And you can't quite write down that function. You can't know when you look at it. Yeah. So it makes sense to begin to pick this up. Right. And I guess I've seen versions of like cancer cells and kind of different levels. What do they call them, like biopsies? Where you look at the cells. Its 9th grade biology. I guess I can picture what you're saying like that there's like healthy cells. My question is what is the machine learning helping with? Is it sort of like reducing the cost of looking at this stuff, or is it like pulling out other signals that are somehow like, useful? I think it's a bit of both. So I think traditionally, the traditional labor was you'd have a grad student whose painful job it is. If you're unfortunate to be stuck in this lab to is look at cell 1,2,3 ...10,000.one to three, ten thousand. Now, I think there a number of readouts where you just look and you kind of know there is a difference. So I think you can train yourself to read these things. I think this is, again like an interesting example you brought up where you're training the model to basically pick out something and you do it at a bigger scale to maybe before I can only test 10000 views. You know, the grad student union would revolt at that point. But now, you know, maybe I can test a billion or I'm limited more by my supplies. I think the second question you asked is actually the more exciting one. Is it possible we can pick out something we didn't know? So I think there are glimmers that this is yes, I know there are a few papers that are doing things like you can identify where the organelles are, you can begin to do some more complex readouts. But I think there is sort of almost a chicken and egg problem here, as in like when you're discovering something it's like unsupervised learning, right? If you know the thing you're looking for, then you can, like, slot it into buckets pretty easily. But then if it's like you want to go deeper and find something you don't know. I think yes, I think there are likely places that ConvNets act as amplified microscopes and like pick up biology that we don't know. But if I knew that, I would have gone off and written nature paper about it already. I'm sure there is a couple that have already come out of this thing. Okay. So I have to ask you, one of the Nature papers that blew my mind and I think a lot of people was the dermatologist's one where they fine-tuned an ImageNet classifier on cancers. That was not like under a microscope, that was just literally just like photos. And that seemed so amazing. I mean, should I be as enamored with that as I felt or are there some gotchas where it's not actually like it? Should we actually using doctors for these diagnoses still? It sort of seemed like from the paper that it was more accurate than the doctor's diagnosis, wasn't it? You know, I think that entire field for sure, I think is like radiology or I think usually it's like pathology or like dermatology. You look at some picture and then you kind of diagnose it, I think that absolutely is a place ConvNets will just make a big difference. And I do think that these models do kind of achieve a striking advance over what you could do previously. So my understanding is that the challenge there is that sometimes these models pick up things that are kind of silly. I remember there is this really excellent blog post where we kind of discussed failed models that are turned out. There are like scans from different trauma centers and the models doing an amazing job, 99% accuracy. Any time you see that 99 percent accuracy know something is up. It turned out there's like some label at the bottom or something that printed to the trauma center so there is like light trauma, Heavy trauma. Guess what that model learned to do right there. So I think it kind of comes down to, what is the model learning? Is it a fluke? Is it kind of an actual thing? Radiologists were kind of tried and tested like, do you really want to fir your world-class radiologist? So I think there's there is a natural caution there. I think in part because we don't really understand what happens deep in layer 37 of the resnet. So I think the FDA and some companies are moving forward. I do think in potential in places where there aren't enough doctors, this could be kind of potentially a revolutionary advance or you could get, you know, world-class scanning centers, available clinics throughout the world, and not just places where you have excellent hospitals already. But I think it will take some time. I remember a number of years ago, I think maybe in the 80s, again, there's a whole wave of hype around expert systems for medicine and how they could diagnose patients. And I think it might have been in that same blog, a retrospective study that found that many cases, hospitals that deployed expert systems, actually had a fall in patient kind of well-being afterwards because there are these complex interactions that no one thought of in the first study. And then you find a number of years later that there is this unexpected side effect. So, yeah, I am, long with an answer there. I think it is something to be interested in and excited about. I think it will also take time to really bet and really kind of like make sure that this is something that improves patient well-being. Although I guess I do know like what happened with the melanoma model, because it does seem like, you know, doctors are also not perfect. And you know, I also cannot inspect my doctor's brain to really know their decision-Making process. So I wonder, is it unsafe to not change, or was there some real flaw or some simplification that it wasn't obvious. I don't think there is a flaw in the paper. My guess is that.. this isn't my field, soof projecting a little bit out there. I know that the entire deploying something in the clinic, in the health care side is actually quite more complicated even than the new biotech side. I think you have to work with insurers that work with payers to work with hospitals and doctors. You know, the American healthcare system has many known challenges. My sense is that this has just been very hard to actually get out there. So, I think, in Pharma Inforum and biotech, I think the advantages is like if you get something to work, there is actually a very well known path to get it to people. I think for advances like this dermatology thing, there's actually a fuzzier, more ill-defined path to get it out there in the wild. I think there are some real scientific questions around is this actually robost that extols an answer. But I think there is also harder business questions about does this make sense as a viable business? And I'm sure there's like a dozen startups who are working on this right now, but I just don't know as much about it. Actually my wife runs a healthcare staff and she tells me that it's the only industry where you can literally save money and save lives simultaneously and not have a viable business. I've had a few friends who left health care and have formed, ostensibly boring but very successful startups and are much happier with their lives. So I sympathize just a little bit. But, you know, you probably know way more about this than I do. Like, it's a little bit outside of my expertise. Sorry to take you out of your expertise. But this is what I was hoping that the podcast. I could corner guys like you to ask all of my dumb questions. I really appreciate it. And I think we should kind of wrap up because I think this might be just getting long for the format. But we always get the two questions that I'm kind of curious, actually. I always say this, but really, I am curious how you're going to answer this and what is one really underrated aspect of machine learning that you think people should pay more attention to.What comes to mind? That's a really good question. I think that machine learning is amplified signal processing, I think, it is a view that is not as commonly celebrated. But I think there is these really exciting things going on. Machine learning is finding its way into instruments, like into sequencers, into microscopes. It's a type of internet of things , but like not the consumer version. I think traditionally new scientific instruments are the predecessor to fundamental new scientific discovery. So I think that when we find deep learning is making our instruments better and more capable. Then I think that we're actually setting ourselves up to discover and build fundamental science. So that's something I'm very excited by. But it's kind of a longer... We might have the instrument and we still need the Einstein or something to come in and work that and really get us that magical new understanding about the world. But I'm excited by that. That is a totally cool answer. But I guess they may give some many readings that it's like hard to even interpret, but I guess a good algorithm would give you a few high values what you call processed outputs like that. I think for now it's still going to be quite a while before. I think we see. I think we talk a lot about HTI and I know there are many ways in which you could get a general intelligence. But I think the process of induction, of interpolating things about reality from very few hunches, This is probably made up, the Newton, the Appletree. Like, if it probably didn't happen that way, we know it's just so story. But, you could imagine some machine learning model seeing that. Can you somehow interpolate from that out to the universal law of gravitation? That I think would be amazing. It just seems far beyond our current science. I feel like with all these medical applications, I guess the reason I naively find them exciting is that, like if you're trying to compete with the human for navigation and driving. Our brains are designed for that. Clearly, like huge part of our brain is just to navigate the world and not crash into stuff. But it doesn't seem like our brains are designed for interpreting molecules that we can't see and like what effects they might have. I mean, I'm still trying to visualize it in my head, I can't even do it. So it sort of seems like maybe the bar is lower a useful algorithm. I think it's a really interesting kind of point there. I do think, understanding quantum mechanics, this kind of, at least doesn't fit in my head kind of. There are lots of complicated things going on about that ##hided## world. Maybe, part of the challenge is that it's hard to validate a discovery. Many times a model says something, but after you spend a while like 9 times out of 10, you're like what bullshit did the system pick up this time. And I think the challenge there is like maybe we have to make the model like you said, we have to make the models robust enough that there is actually high-quality signals coming out. So we're like, oh, that's a clue or not. Oh. I don't know what Hiccup happened then. In you know, step two thousand of gradient descent. So I think that's maybe the challenge where we just haven't. I think this was beginning to change. It feels like still discovery, like invention is the province of the human and not the machine. But, you know, maybe that's like, you know, the antiquated line and 10 years from now AI will have discovered everything. And I'll be like, well, that aged poorly there. It will be an interesting world that comes to pass. All right. So final question is, so, you know, right now in 2020. I guess its already June, what do you think is. What do you think is currently the biggest challenge of making machine learning models work in the real world? Like in your experience, what are the challenges that you've run into? Like what have been the surprising hurdles? I think things more specific to me are often small data. Like, again, you have 30 data points and oftentimes it's a very well-meaning scientist who kind of comes and says, what can you do for us with 30 data points? yet oftentimes I'm like, oooh, I wish I had a better answer. Sometimes you just try seven things like you're trying to transfer learning and you try like multitask learning, the mental learning and all the learning fail. And then at the end, the random forces like, yeah, it's all great, but it does something. So I think for things I'm excited by, I think robust transfer learning that actually works on small data, which I think this has occurred in NLP. But I think has not occurred in molecules, I think that would be an amazing advance for this field. It's so interesting. It hasn't occurred because I think it's also totally happened in vision for sure. And NLP now, definitely it's too interesting. It doesn't work for molecules. It might just be data. I think if someone just found a gigantic trove of molecular measurements, so it is high quality, you can build into them. Collecting it, nobody is going to find that right. I think this is just one thing that I think the governmental effort could do, like amazing work. You know, to be fair, I think, governmental agencies have actually put out most of the open-source data out there. So they are actually working hard at this. But, yeah, maybe the sort of thing that like if you get a $10,000,000 grant or something. I think you could make a serious dent at putting together a high-quality open data set for this but it is more expensive than ImageNet, and it will take more resources. This means you could do the actual experiments. Great answer, I love it. Well, thank you so much. Is there like someplace we should tell people to contact you or is there a thing you want to promote, maybe DeepChem. Everyone should try it. Absolutely. I think part of like the goal behind DeepChem is to make opensource more feasible for drug discovery. So I think we could definitely use more users. In particular, if you an engineer that knows how to handle build processes well, please get in touch, you know, I am trying to figure out the windows and etc. builds and it is such a pain. I am too much of a scientist. We could absolutely use more help. So if you are interested in open science, please do get involved. I love it. Thanks, Bharath. My pleasure. Thank you for inviting me.",9287 +Chip Huyen — ML Research and Production Pipelines,https://www.youtube.com/watch?v=6adNHwE5PHY,2587,2020-07-28,"I interned at a start up where they used Tensorflow it was an internship and I was blown away. Like, wow, I didn't realize you could do so many things with it. So I went to a couple of my professors and was like, can you teach a course on this? And my professor was like, I don't have time. Why don't you teach it? So I was OK. You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Chip Huyen is a machine learning expert currently working at a startup that focuses on machine learning production pipelines. Prior to that she worked at NVIDIA, Netflix, and Primer. She also taught the course at Stanford on tensor flow for deep learning research. And maybe most fascinatingly, before she became a machine learning expert, she was he was a best selling author in Vietnam, I'm super excited to talk to her. My first question for you when I was thinking about interviewing you was actually.. I really want to hear the whole story about how you got into machine learning. I kind of have bits and pieces of your background that you've told in the past, but tell me your life story. How much time do we have? We can cut it down [laughs] How did you get into tech in the first place? It’s a funny story because I come from a very non-tech background; as far away as you can think of. So after high school, I didn't go to college and I started traveling. So I did that for three years and in the process, I was writing; I was writing for a newspaper, I hosted a couple of columns and I wrote a couple of books which got me into more trouble since and I wanted to homes than I wished for. Wait, what?! You know, internet popularity is a double-edged sword. So my books got to be popular and it was very popular, that sounds a bit self arrogant. They were best sellers, right? I think it's fair to say they were popular. In Vietnam, because while traveling, I met people on the road and I was young, I didn't know to handle all the attention. And people were like all these like “it’s not possible for a girl to travel by herself” and most young people were like, “she didn’t write the book. She didn't write any of that.” Like “she must’ve had people writing things for her, doing things for her, accusing me of having a lot of money, lot of travel and all. So there was a lot of controversies and I was a little bit offended. I was like, who are these people? Why are they like making me answer all these stupid questions? But at that time, I did not know how to handle that and it caused a lot of like a backlash. So I was like, okay, I'm so tired of this. I'm going back to school. So I went back to Stanford and I was thinking of doing something like writing or political science. And I was at Stanford and everyone told me that the question is not whether you should take a CS course but when you do it because 90 percent of undergrads take the CS course at some point. So I just took a course in the first quarter and I really liked it. When you came to Stanford, how old were you? You were already a bestselling Author then.. You don't ask that question [laughs]. So I was older than my classmates when I took my first quarter and it was fun and I kept on doing more courses. Before I knew it, I was a C.S major and I took an A.I course; I cried a lot in the first class because it was so difficult. [laughs] I think when I came into Stanford it was a peak of the A.I. Hype. Can I say AI hype? You can say literally whatever you want. [laughs] So I did that and it was fun and yeah here I am and I think in my third year I taught a course. Your third year as an undergrad? Yeah. Wow! I’m trying to think about what I was doing as an undergrad, I feel embarrassed. To be fair, I was like older than most people, right? And I actually didn't have to spend time on frat parties, or trying to like impress people. I was pretty much done with the party scene by then. Right, so you taught a really popular class, right? I think you could say it was a popular class, I could say. I think you can say that. It was quite unexpected. I didn’t even know that the class was popular. I was just teaching it. And one day walking to the dining hall, a friend was like, did you see that comment about you on Hacker News? And I was like, why would anyone say anything about me on Hacker News? And it turned out that my course had been picked up by hackernews and I was like wow, that's interesting. And at some point, you know what happened? I was not really active on twitter back then and one day I opened twitter and I saw I had ten thousand followers and I went, “Wow. Who are these people?” It was great. Well, it was timely class. What was the topic of the class? Just for people who might not know. Oh, it was TensorFlow. I think it was the right time. TensorFlow was very popular in 2016. Wow. It's crazy how fast things change. Like back then in 2016, Tensorflow was all people could talk about and now it’s what people complain about. So, yeah. So I taught a course on TensorFlow and the official name was TensorFlow for Deep Learning Research, which is a not-so-flashy name and I think I also put a lot of materials online. And I think it was maybe the first college level course on TensorFlow. I’m trying to remember it in 2016, I think you must’ve had to compile it yourself right, to use the GPU back then? I'm trying to remember… I remember just installing TensorFlow as a pretty painful experience for me. I don't remember it to be so painful. It was just some concepts that were a little bit hard to grasp. As in like a computation graph; so there should be a graph first before you can run it. So I think Tensorflow 2.0 now is a bit different. Got it. How did you come up with the material for that class? How did you even think of that? So when I started teaching the course, I was just hoping to learn personally, you know? I started taking courses as a sophomore, sometime in my second year. So I didn't know a lot. I interned at a start up where they used Tensorflow it was an internship and I was blown away. Like, wow, I didn't realize you could do so many things with it. So I went to a couple of my professors and was like, can you teach a course on this? And my professor was like, I don't have time. Why don't you teach it? So I was OK. And I got a lot of people to help me. I had some friends at Google who knew a lot about tensor flow, I had professors take a look at my curriculum and got a lot of feedback. I read lecture notes. I was really nervous so I had really good friends who were coerced into being my fake students. So for every lecture, I would make them sit and listen to me. Give them a fake lecture. So I think I got a lot of help. It was like learning together with my students. I didn't think of it as teaching as much as a group study. That's super cool. When you came to Stanford, you hadn't taken any computer science class before? So I came from a math background, so I did math in high school. So I think I took some C.S courses but, you know, it was more like very, very basic. And if I remember, it was a blue screen back then. Wow. Yeah. This is simply amazing, you went from introductory computer science type stuff to teaching a TensorFlow class two years later. It’s amazing. Do you have any advice to other people who want to learn this stuff? I think this is the beauty of computer science; the barrier’s to entry is really low. Also with ML. Especially with the experiment-oriented progress which translates to mean that you actually don't need to know a lot of theories to make a contribution. So I've seen people who get into ML for like a year and are able to make pretty great projects. Which like I am still a little ambivalent about it. It’s good as it lowers the entry barrier and allows more people from different backgrounds get into it. So like what does it say about the field, when somebody joins for a year and is able to make a pretty mind-blowing experiment. So I don't know how to feel like... Maybe it means that there's lots of interesting stuff to try [laughs] I'm so skeptical of giving advice. I think I’ll just say get your hands dirty. Then try things, try things out and be friends with smart people. Be friends with smart people [laughs] [laughs] I have friends smarter than you, I think. I don't think I could have got anything done without my friends. Really. That’s so cool. I think that's good advice. Why did you choose to go into AI? For me, it was just the promise that AI held. So I come from a village in Vietnam and I travelled and there was a time I realized that it could be great to actually overcome language barriers. For the simple majority of human knowledge, it’s written in English, and people who don't speak English can't access it. Like people in my hometown can't really read anything that I write in English or my parents are afraid of visiting me in the U.S. because they wouldn't be able to navigate the airport or how to get here. So at this time, I was really interested in machine translation and thinking if you can automate a translation process, then it could be really, really helpful and if we can overcome the language barriers and help people, then maybe people from my village can access human knowledge or just step out of the border. That's so cool. That's what I thought back then and very idealistic. What are the topics that are most interesting to you right now in machine learning? I think over time, what we are liking is better engineering in machine learning. So there are two aspects, both as in engineering and in research and during production. So in research there are a lot of researchers who are amazing at what they do, but who are also not good with engineers and it's not because of them, it's just because as humans, our time is limited. We focus too much on research. We can't expect them to be great engineers. So I wonder if there is a better.. if we can build a good tool or toolset to help researchers carry out their research more efficiently. Also if you have clean code, it’s easier to experiment and this helps with reproducibility and in production, I also think that there’s a gap with people; researchers and production engineers. I think there's been a lot of progress in researching machine learning and now the question is how do we bring the research into production and that’s what I'm very interested in and also the start-up I'm part of right now is also focusing on that by helping companies productionize machine research. You've worked with some big companies, generally what kinds of problems do you see when companies try to take on research and production? What are the main ways you see companies fail at this? One of the big things is a lot of companies are chasing buzzwords. They are like how can we use BERT, How can we use transformers? And you can look at them and say you actually don't need that. You don't even need deep learning. Like, a lot of your problems can be solved by traditional classical algorithms. So sometimes companies say things like this one should use very fancy techniques. The reason can be because they don't understand what is happening, because I think there is a lot of misunderstanding in AI reporting that you will see, like a lot of journalists and reporters talking about AI and if they don't have background in AI they can simplify or just present it as an accessory they give. Like what? What exactly is going on? And it's like in some companies may just want to attract clients, like you said, they say we are using state-of-the-art techniques. So some companies actually go out of their way to try to use that. So that’s one problem. I think the second is a lack of data, and I think you guys know that very well because you also are trying to solve that problem, right? So in research, people work with very clean static datasets and in any business, you'll want a clean and standard datasets because you want to focus on separate models. I think more and more as models are being commoditized, you can take off-the-shelf models. So now the bottleneck is data and real world data is nowhere close to research data. So the problem is how do you collect data and verify data, how do you cope with constant distribution shifting/data drift? So there's a problem with data. Another problem is with interpretability. So in research, sometimes you can have more of the state-of-the-art but also in the real world, you just don't care about accuracy or F1 or whatever metric do you were seeing]. How can we explain the decisions that the model is making? I think a lot of people are focusing a lot of time on this so I think we are making progress. It's funny, I saw a tweet recently. I think it was someone had Opeen AI who was arguing that folks should not teach anything besides neural nets. Oh my God, it’s such click bait. And it's funny because it reminded me of when I got my first job out of school. There was sort of a similar debate but it was different topics, it was basically like machine learning versus rule-based systems. And there were a lot of older researchers who had kind of built their careers on logic and rule-based systems. And they would say, oh, obviously you should do both and I was like, “Come on! Rule-based systems don’t really work on anything. Can you find me a benchmark where it actually makes sense to use a rule-based system?” And I was thinking, that’s how I felt at the time and I think I might still… I mean, you don't see a lot of rule-based systems in production in the last decade or two. I don't come across them, I guess. And then I was thinking, you know, when that person made a click-bait tweet and I went, no, no, it's ridiculous. And then now I’m thinking, like, am I now the old guy who is just justifying the things that I know… No.. No.. Did you get baited? Did you participate in the discussion? No. I'm always afraid of controversial topics on Twitter. [laughs] It is not even controversial, it’s just wrong. Like I don’t get it. I mean, if it is not from some company, I would that person was trolling. Maybe he's trolling. I'm not sure. Unlike my work where it just feels like mild research - it's not in neural nets at all, you actually work in neural nets. What are the non-deep learning algorithms that you think are useful that you would keep around? And why would you use a different one? I mean XGBoost is still like the most popular algorithm in Kaggle competitions. K-Nearest Neighbors is still really good for anomaly Detection there are a lot of really great algorithms. A lot of people don’t even know what Boosting is which is a bit sad. The other day my friend was telling me about how he was interviewing somebody and the person could explain perfectly well what a transformer model is but can't explain what a decision tree is. And it was… I don’t know, maybe I'm old, too. I don't even know anymore What are the situations where you would recommend using a boosted tree versus a neural network approach? I think definitely from baselines. For example, If a simple model does a job reasonably well, there’s just always value in trying. But in production, the simpler a model is, it’s easier to understand, implement and to avoid mistakes. So if you don't get improvement from more complicated methods, don't go there. It's also hard to tell because a lot of improvement is incremental, right? So you can say this only gives 1% improvement, it's not worth it. But for like a 1% improvement, you invest more time investing again, then another and another and another and over time, you can get it up to 10% improvement but stifle it from the beginning, then you would never be able to reach the point where it should be. I’m not pro using machine verb or an deep learning, I’m very pro deep learning, I’m just saying we should not forget singular baselines and I don't think we spend enough time talking about or defending baselines. Interesting. A lot of people have said that and I happen to agree with them, but if someone were to ask you why are baselines important, how would you answer that? A metric, by itself, doesn't mean anything. You saying 90% accuracy doesn't mean anything. So we say, “Oh, my model is amazing. It has this accuracy.” What does that even mean? For example, somebody showed me this model and is like it has 90% accuracy and say, wow, if you predict at random it's like 89% accuracy already, so what is the hope? What is the point of getting this? So I think baselines and landmarks should help you localize where the model performance is and where you want it to get you. Also interestingly looking at human baselines, for example. If humans could just understand just how well they could do... So maybe say 90% or a really amazing human baseline is like 85%, we say, oh, that’s superhuman performance, but even with a human baseline of 99% percent, we know that we still have a long road to go. So the human baseline is maybe in some sense like a best case scenario. Yeah. I think in some cases, in a lot of cases, human baselines are equal, but it's not so always. Right. Another thing I'm curious about is your work on this, because I think fake news is probably going to be a big topic again with the election coming up. I think this was a fun class project. So it was after the election and we were curious to see… I think echo chamber can also help echoing fake news. I feel like the same fake news is usually circulating the same echo chamber. And, you know, if one echo chamber shares a certain piece of news and now the echo chamber shares similar news but with different prospective, I'm not sure how healthy this might be but it might be interesting to cross-share, to bring the similar piece of news from one prospective to another echo chamber.Not fake news, of course, there’s no point in spreading fake news from one echo chamber to another. So what we did was that we got a lot of tweets and formed hashtags. At that time, as I said, it was after the election, so we collected a lot of tweets from during the elections in U.S. sitting states and what we did was we had some seed hashtags that we know was from tweets that were pro-Republican or pro-Democratic. So from those seed hashtags, we made exemptions that if two hashtags belong in the same tweet, then they likely have the same sentiment. If one hashtag appears next to a hashtag that’s pro-Republican then it's likely to also be pro-Republican. So from that, we had an algorithm that was used just to resolve conflicts. It's very simple like major reporting and from that we were able to label about a thousand hashtags. And so from those hashtags… So just to be clear, you labelled those hashtags as liberal or conservative essentially? Yes. Based on what they co-occurred with? Yes. So after that, we looked into tweets. So we built a graph of relationship between users, for example, if user A replied to or retweet another user’s link, we group them. So we built graphs of users and we had to look as a kind of the hashtags they use. So we tried to predict whether a person is liberal or conservative and then we used some graph algorithms to detect communities and then we looked at those communities and checked whether this community had much more conservative than liberal members. It was really fascinating because what we found out was that…. About 50% of the communities we found are neutral, like there’s a difference between the number of conservatives or liberals is not that high. But then about 25% of them are conservative versus the number of conservative members are like more than three times higher than Democrats and this is 25 percent Democrat Communities. So you found the echo chambers? I'm not sure I would say echo chamber but I just feel like people who share similar beliefs definitely have stronger ties with people with different feelings. I see. So then were you suggesting to sort of like spread information between these communities or...? We never had access to that because it would require having such a strong social network, but we would have ideas. So we ran to some literature and they say that if you actually show somebody perhaps a news article with an opposite point of view from their beliefs, they're going to ignore it. So if you believe in a show and saw the opposite of it, you’d only be like, “oh, just fake news”, right? So you actually have to slowly show them a similar news article but with a slightly different point of view. You can't give people a totally opposite viewpoint and expect people to listen to it. I see. That makes sense. So we never had a chance to test an algorithm but we thought that detecting echo chambers maybe a first step in finding a way to break them up. I see. Yeah. A lot of people just want to live in their echo chambers [laughs] I think that Silicon Valley is a massive echo chamber, really. I think we all live in the bubble and I feel like somehow this pandemic has made me realize how different our bubble is and how strong it is. The other question I wanted to ask you about is you've recently gone from a bigger company to a start-up, which is a little bit of a shift. Has there been any surprises there? How does that feel to go from, big company to start-up? Was that a big cultural shift or...? It’s a big difference. It’s such a big difference and it's just what I wanted. I thought that after graduation I would like to try a different working environment to see which one I would like. So leaving NVIDIA was not a reflection on NVIDIA, it was just a reflection on myself, because I just wanted a change. And my co-workers at NVIDIA have been really, really, helpful. And, um, they're really great. So it was a big shock to join a start-up. The first I think, is the workload is so much more; which is a good thing.So you know at a big company, you might be leaving work at like 5, 6:00pm but at a start-up, you might have P.R requests at like midnight on Saturday, right? Which is not a bad thing… I don't know.I think I'm just still very ambivalent about one's work-life balance discussions, you know. So some people say, “You shouldn’t work on the weekends” or “No company should expect its employees to work on a Saturday evening.” As much as companies may not expect you, it isn't what they expect of you as much as what we expect out of ourselves. I don't want you to promote the pressure of working too hard but I do believe that when you leave careers, there are certain compromises you might have to make, which depends on what you want out of life. Anyways, there's a very roundabout way of saying that I work on weekends. [Both laugh] It sounds like you're enjoying that experience. I also don't have much of a life, you know, so... I guess right now there's less going on, yeah So yes, it’s a big shock that people work on the weekends. I think I like it. I'm not sure how much longer I would like it for because I also heard that when you have family, like for you guys, I heard you had a baby recently. Congratulations! Yeah. Do you still work on the weekends? I don't think I have a very strong point of view. I feel like it's a little bit weird to tell people not to work super hard. Like, I worked incredibly hard in my 20s and. I kind of like to imagine that hard work pays off. I feel proud of the stuff that I did. I think you did a great job. I mean, I’ve heard you have done a lot of great things. I’m a fan. Oh thank you. In the best situation, working really hard can be incredibly fun like for me. So I’ve realized that I’d rather run a company than, you know, work for someone else, that’s my particular point of view. But for me, working really hard, can be like a real joy. Like when I started my second company, one of the really fun things for me was that it actually made sense for me to pull an all-nighter once in a while, allow it to happen. Really? Do you still pull all-nighters? No, now I have a baby so I have a different kind of all-nighter. [Both laugh] I think that the important thing is that the company is trying to do something and figuring out how to do it over the long term is the important thing and it needs to be at a sustainable pace because if you work hard and burnout, that's super counterproductive. But I don't always think that burnout actually comes from working hard. I think burnout comes more from like working hard at things that seem pointless or not seeing the success that you wanted. Yeah. That makes sense. So I think for me I see people working on the weekends doing better and like me I also like working on the weekends because I feel motivated, because I really like what I do and I have faith in it and I also want to contribute. Like if on a Saturday night I could stay at home and watch some I don’t know, like the bachelor. I don't know what people seem to be watching or I could just go on Github and check out some P.Rs. I know that sounds like a horrible analogy, but yeah. I feel motivated. So okay. Anyway, back to the question, the first thing I noticed was the different workload. And I feel okay to work on the weekends because I also get to talk to my co-workers on the weekend and I like it. Then there’s a very simple understanding because I think everyone knows that some people work on the weekends and it gives them more flexibility. Like if you work on the weekday and you feel burned out and tired, you can take a day or two off during the week, it’s fine as well. And so this understanding is in consideration while making the schedule as it will be unnecessary for us to follow a typical five-day workweek and I work on the weekdays and then take the weekend off. I think the second thing is the exposure I have to the entire stack. So working in a big company, I was focused on very specific products and shielded away from aspects like QA or client but at a start-up, I get the chance to see everything and we sort of built it from scratch. So I'm exposed to the decisions that we have to make; like what’s the tool to use or how to structure the repo. For example, do you want a modern repo or do you want a very small repo? So a recent decision is what tools to use because when you join a big company, usually you had to use standard tools and someone has decided for you but at a start-up, you have a say in choosing the tools which expose you to various problems as well. So I really like it. And also, of course, there’s the size at a big company. There are a lot of people and I think it’s nice that at a big company, you have access to a lot of people but at a start-up, you have only a small number of colleagues so you can’t send a message to somebody from another team because all you have are just the people on the team. Cool. Makes sense. So I think we’ve gone a bit overtime but we always end with two questions, really kind of curious to hear your takes on these. So the first question is, what is a topic in machine learning that you think people don't talk enough about or is an underrated topic that people should talk about more, but they don't? So I think you have a list of things I usually carry around. One of them is graph. I love graph and I used to think they were underrated, I tweeted about three years ago about it. But I think that has changed. Like I was at Neurips in December and I saw so many papers on graph and there was a workshop on graph and it was one of the most attended workshops. They tell you about the computer science sense of graph, right? More like grafh networks. So the graph theory. So now there are a lot of graphs like GNN or GCNN and a graph convolution networks. So I think there are multiple uses of graphs in deep learning. So a graph is a measurable computation of many inputs, right? So I have a `data from social network that for example, is a graph or reclamation systems and we have users and items and it’s a bi-part graph. So graph is a measurable representation of input and a lot of distributions can be represented using graphs like pictorial graph. So graphs can be both input and output. And also graphs can also have a lot of relationships with convolutions so they focus on the local connections. So a graph at a point can be connected to major neighboring points and convolutions when you have a very local linear transformation. I don’t know if I’m making any headway with all of the explanation right now but yeah, I’m in love with graph and I'm so happy to see that it's catching on. I think that other things that are underrated is the engineering aspect of machine learning. So we see a lot of people talking about integrations for deep learning or version control but I think people, from what I’ve seen are beginning to catch up with it. Cool. Good answer. Obviously I agree. The second question is, in your experience, I'm really curious I guess at this and actually say in your experience... If you don’t mind, I’d like to add that I feel like another thing that’s underrated is, I’m very production-oriented, so I think monitoring. If you deploy a system, how do you monitor it? How do you know when you need to reach in the model? How do you know data distribution has shifted? You know, so I haven't seen a lot of monitoring. So I think it's still very underrated. Yeah, totally. Yeah. Sorry. No, no. Trying to sound smart, you know? I think you’re successfully sounding smart. But the second question is, in your experience, taking projects from training into potentially deployed systems, where is the biggest bottleneck? What's the hardest step in the process? What’s really clear right now is when you have very big models, it's really slow to run inferencing for so long. It's very slow and it can be very costly. So for example, you can try to take GPT2 to go into production and you can just even spot instances where it costs quite a bit for every inferencing, like for every time you make a prediction, you want to generate something. So this is why we haven't seen a lot of GPT2 in production yet and it's a very interesting problem. Like I'm not sure if I can mention the exact company but some start-up told me it wasn't using. GPT2 in production and they say if they could reduce the inference time by half, they would be able to break even so I presume that instead of using a novel precision point, if they can somehow make it work on first for 16, at half-precision point, then it can reduce the inference then by half and therefore, it’ll help the company stay afloat. It’ll make a really big difference; like you just break even, or not. Especially in this economy. Yeah, totally. Wow, great answer. Interesting. My final question actually is simple. If people want to learn more about your work, do you have a Twitter account or a company account you want to tell us about? Yes. I spend too much time on Twitter and I'm ashamed about that. Follow me on Twitter. What's your Twitter handle? It’s @chipro. That’s c-h-i-p-r-o. as in professional, but r-o means crazy, so chip-crazy. Oh really? I did not know that. That's awesome. Yeah. Also I have a blog where I blog about tech and stuff but I write long form like each of my blog postsa takes me like two to three months to write. So I don't write a lot but you should check it out. That's great. Thank you. Thank you. Coming from you it means a lot.",5888 +Peter Skomoroch — Product Management for AI,https://www.youtube.com/watch?v=hSyb3xEvCrI,5294,2020-07-21,"So Peter, what are you working on? So I'm spending much of my time at home like everybody else, but I'm doing a bunch of writing and some angel investing, and I'm actually doing a bunch of gardening as well. And so I'm taking this time to kind of reboot my garden. And one of the problems that I have that's driving me crazy is raccoons. So I'm in San Francisco and there's a lot of raccoons in our neighborhood that are just tearing everything up. Pete Warden has this great book, Tiny ML. Oh, yeah. And it uses TensorFlow Lite. I'm sort of going to play around with that and actually there are these robots; you can get these on Amazon or from Wal-Mart or something. It's really cool. It's called Mekamon. The company, unfortunately no longer exists, but you can get these kind of cheap. Somebody did some bluetooth sniffing and actually put out a Python API so you can actually control it. It's got servos, It's pretty robust, can move pretty quickly and so I'm going to do a project to make this sentinel in my backyard to protect the garden from raccoons. Hopefully non-lethal. Protects. Protect and serve. Yeah, non-lethal. But I think raccoons are afraid of things like dogs. So if I can get this to bark like a dog, maybe that would work. I feel like the San Francisco raccoons are afraid of nothing. I don't know. I'm curious how this goes for you. But I got to show you, I got... You got the same book! This guy is bigger (show him a robot). Put this together myself, bought some kit off of Alibaba. By the way, there's an open source project called Spot Micro. If I go and buy a 3D printer, I could put it together but it's like an open source spot Mini. I think it might have been open AI or somebody else; they put it in their simulation framework. So there is a model, Mujoco I think it might be. So you can actually train it in a simulated environment but I think it's still very nascent . I haven't seen an actual working video of this thing working so I think people are still trying to build it. It's not working yet. I want to start off with going back to your first job or you doing ML data science. I can say maybe even before LinkedIn. AOL, I think it was, right? Well, yeah. Even before that. So I think when we met I was I was at AOL or just finishing up at AOL Search. And so that's an interesting conversation in and of itself. I was working on the search team there. I had just started. I came there from M.I.T. and the week I started, there was a release of user search data. That was the weirdest first week on the job ever, for sure. Everything was in disarray when I started. But going back, so first job, I mean, when I was an undergrad, I worked in physics and neuroscience and so machine learning was, I think, kind of seen as voodoo to those folks back then e.g there was this thing called deconvolution of signals. I worked on a project. It was a summer project; we were working on anti-matter, which sounds cool, right? But positronium is basically a positron and an electron that forms an atom, like hydrogen except with matter and antimatter. Anyway, long story longer, the hard part was you're taking these measurements, real world sensor data, you're trying to detect the decay of this atom. Because when the positron and electron are near each other, eventually they annihilate and it gives off radiation. And so you have all these sensors and equipment set up to measure that annihilation but the problem is, everything that you're using to measure it, pollutes the signal. So what you actually measure is this convolution of the raw data with those signals. And so at the time, one of my projects was to deconvolve that signal. So deconvolution is one of the core machine learning problems from back in the day. It's kind of like the cocktail party problem. You have a bunch of people speaking and you need to disentangle the different voices in a recording. Anyway, that's actually what I think was the first work that I did in neural science, I was really interested in neural networks, basically and interested in process. I found the processing of the data more interesting than the actual experiments, which are, you know, a lot of hard work and a lot of time in the lab. So I really just dug deeply into signal processing and machine learning. And so before I got to AOL and I actually had two other roles where I cut my teeth on big datasets. The first one was actually back in the first dot.com crash. It was a company called Profit Logic, and they ended up getting acquired by Oracle and became Oracle Retail. This was back in the day when they'd ship the data to you on tapes. So if you had a customer, to get access to their data, people say it's harder now to get access to customer data when you're dealing with enterprise customers but back then, we actually had to get tapes of data sent to us - a lot of point of sale data, retail sales data, and we were building predictive models for retail sales. Then I worked in biodefense at M.I.T., which is unfortunately a relevant topic today with the pandemic happening but we were working on these kinds of things. How do you detect and measure and prevent these kinds of things, both naturally occurring as well as, you know, you can imagine a terrorist attack with some kind of weaponized bio agent. So you did predictive modeling in the first dot com boom. Yeah You got tapes, how big is the data? What kind of models are you building? What's your tool set, satellite? That's a long time ago. . Thinking back, you know what's funny? We were actually using Python, right? So you're using Python and C++. And so at that time, almost nobody was using Python. Google, right. Google. That's right. So it was Google and then a handful of other startups around the late 90s, early 2000s but in the enterprise, Python was not really widely adopted yet. This was really close to the metal kind of work because you'd get the data; I don't remember the exact volume, but you can imagine a customer like WalMart, right? They have thousands of stores, thousands of products and each of those products has different sizes and colors and styles. So really at the SKU store level was the granularity of data and then every transaction so similar to what something like Square might have now where you have somebody buy a coffee, you get that transactional event. But at the time, the lag was really long. So these are brick and mortar retailers and so they were mostly running on Oracle or DB2 or something like that. And they would have their point of sale system and then that would be aggregated up, all that raw data usually to something like, this shirt, it would be a SKU and if you look at the sales of that SKU, you could see the sales nationally. Then typically what would happen is, there's markdowns. So two weeks and you lower the price and then there's a sales bump depending on the elasticity of the item, the price elasticity. And so really, the machine learning problem there was; the first few days - actually, I think we cut it pretty close, we had a weak turnaround time - we would get the raw data and then we would run our models; so some Python and C++ models to do forecasting and optimization and model fitting. Then we would basically want to spit back price recommendations. So for every SKU, we're telling the retailer, ""hey, mark it down 10%, mark it down 15% so that you'll optimize sales over the entire season. "" And many times what ended up happening is it would take a few days to load the data. We'd have to get the data from tapes, we'd have to upload it (there could be an issue there). Then you'd have to get it into our models, run the models, which I think that could take about a day or two at that time. And then at the end of the day, this is going to sound really bad, but QA at that point in time looked like... So the final product we would send was basically a CSB report that would go to the retailer that they would then put into their point of sale system or their buying system but there were some nights where we actually would just print out the CSBS and look at them. And then mark, ""this looks wrong"" with a marker and then go back and actually go in with Python and fix and change the models. And those are late nights. What was your title at that point? What did they call you? I think back then it was kind of like being a glorified grad student, but I think my title was Analyst or Data Analyst. Data scientist didn't come around until like 2009 or 2010. But then you actually were one of the original data scientists that decked in in the early days. Yeah, yeah. I think everyone's gone on to do really awesome stuff. How did you come across LinkedIn that early? Yeah. So after I did the stuff at M.I.T., I went back and I did some neural network grad work at M.I.T. and then I was at a well, I wanted to move into Consumer Internet. It was actually like a year after I left Prava Logic that they got acquired, so I think I kind of got bit by the startup bug when I saw, ""hey, there is a light at the end of the tunnel, all these things can work."" And so I was eager to get into consumer internet. So I was at AOL search in the D.C. area. They were based in Herndon, Virginia and so my goal was to just move out to the Bay Area. Actually, when I left AOL, that was the first time I signed up for LinkedIn because, back then at least, when you would leave a company, you would get all these LinkedIn invites from your co-workers and I hadn't created an account yet. This must have been back in 2008. I got a LinkedIn invite, signed up and I liked the product. There was a group, you know groups was a bigger feature back then. So I was on a lot of the early Hadoop groups and machine learning groups on LinkedIn, connected with a lot of people. I think that's how we actually met, was maybe in the Mike Driscoll had a big data community back then. He was blogging and he later went on to found metamarkets. But yeah, basically I came out to the Bay Area and I was interviewing, talking to a bunch of different startups. LinkedIn was interesting to me because, I think a big reason was, Reid Hoffman, the Founder, is really big on networks and I was a big believer in the power of networks and connecting people to communities. This was before Twitter or I think Twitter was out but Twitter was still pretty early. So there was LinkedIn, Twitter, Facebook. I was a big believer in the mission of LinkedIn specifically because it's unfortunately relevant again now. I think there's nothing more meaningful you can do right now than get someone a job and people working on things that are important to them and fulfilling and that feels like it matters, I think is really important. So it seemed like a great opportunity to leverage data, they had amassed this large data set of all these people, their profiles, their connections but they hadn't really flip that switch yet that like machine learning or data science switch to leverage it to have impact. That was just beginning when I got there. So what were the early projects? What were you doing there? It was interesting. A lot of the core elements that you see today were there at that point. So there was a Profile, you could connect to people, there was an early version of 'People You May Know' - Jonathan Goldman was the first data scientist to build that but back then, it was running on SQL. So it was like a big SQL area that would take a few days to run our series as inquiries and the network was much smaller; I think there were only maybe like 10 million members or something at that point. Now there's probably over 500 million. I don't know the latest number on LinkedIn, but at that point, the data was small enough that you could sort of make that work; but I think it would actually take over a week to run People at that point. And so some of the first products I worked on, the first one, actually, D.J. Patel, who was running the team at the time, I was lucky enough based on some of the stuff I'd done before, he gave me a bit of latitude. And he said, ""just come up with a new product, come up with something that you think we should do and we'll pitch it to the board"". And so what I came up with was LinkedIn skills. So at this time, I was still basically an icy data scientist, product manager-type person. So I pitched this idea of, ""hey, skill seems like an obvious thing you should have as an element of somebody's profile and there's all these other cool things we could do with it, we could use it in search and ad targeting. But we could also use like you could endorse each other for skills, things like that."" So we had these early notions that we could do stuff like that but the first task is how do you bootstrap something like that? So I'd say the first year was basically bootstrapping and building that from scratch. And so we put put a team together. Jay Kreps, who's the co-founder of Confluent was my first engineering partner on that project. And then Sam Shaw, who rewrote from SQL to map reduce the 'People you may Know' was then my second engineering partner and became my co-founder later at a startup SkipFlag that we did, but that was the first big project. How did you bootstrap it? Yeah. So actually, crowdsourcing came in. I think we may have used Crowdflower. We definitely used Mechanical Turk. Basically it was a mix of different things that we have, maybe in the show notes I can give you some references for papers on how we did it. The core prototype I actually built in, I think, a few weeks and I really just slapped it together with duct tape. Again, some Python, SciPy, SciKit-Learn existed but it was still pretty nascent at the time, and MapReduce. The stack back then looked like Hadoop, Pig, some Hive but we mostly settled on Pig which came out of Yahoo! And it was basically a bunch of batch jobs. And the idea, the trick, which I had kind of picked up when I was at AOL; at AOL, I was working on mining patterns from search query data and then crawling external Websites and trying to actually understand the topics in those sites. You can imagine, if you're on TripAdvisor, what are the topics on TripAdvisor? What are people writing about in reviews? What are the locations that are in search queries? So I spent a lot of time working on N.L.P. and information extraction. And so that was basically the idea - to bootstrap skills. We had about 10 or 12 billion English language profiles and basically it was a bit like Word to VAK, but pre-worked VAK extracting commonly co-occurring phrases from those profiles and then getting a bunch of candidates for named entities essentially from the raw text and from those candidates for named entities, again, similar to how a lot of people do named entities now. They use Wikipedia or deep, Wikidata or things like that as a source of truth. If we could map those phrases, those surface forms to an entity in Wikipedia, then I could normalize those. So if you say ROR or ruby on rails or rails, we could disambiguate those to LinkedIn down to be the same entity. And so it was a really primitive, in some sense, form of things that we went on to do at our startup SkipFlag. But wait, wait. So you just used Wikipedia to pull in the skills? So we would use that as a, I guess I would say, means of normalizing the things that people would say on their profile. So you could basically name any disambiguation. So if somebody says a phrase like, let's say Angel. . so if you say Angel, do you mean an angel investor or... There are people on LinkedIn who do psychic healing and they can talk to angels and so you want to be able to distribute those two roles. So you used Wikipedia for your ontology. The ontology, Yeah. So basically the knowledge graph was complicated. Not everything is in Wikipedia to the notability criterion for creating a Wikipedia entity is pretty high. So a lot of jargon, a lot of domain specific stuff is not in Wikipedia. So we would only use that as a.. let's say, Weights and Biases, people start putting that on their LinkedIn profile; that's an emerging topic which can be in the knowledge graph but if we can link it to Wikipedia, that gives us a lot more evidence and data for tagging. Would you like manually review new skillsets? Like, how would you know? say like Weights and Biases becomes a skill people want to put up, how would you create a new one? Initially it was a combination. So we would have an automated skill-discovery job that would detect emerging topics that probably were skills. This is where production machine learning gets really complicated. If it's user-facing, I think I was maybe overly paranoid, but we really ended up not having a lot of issues with things like profanity and other things like offensive topics. And part of the reason was we had many layers of vetting. And so some of that was human-curated meaning we had humans come up with white lists and black lists and grey lists. So it might be OK for you to put on your profile. For example, alcoholism. If you are a psychiatrist, who helps deal with alcoholism and drinking disorders and things like that but you wouldn't want the machine learning algorithm to automatically suggest that to someone and be incorrect. Oh I see. That's an example of a grey list? That's a grey list. So we had multiple tiers of, ""Where is it appropriate to use this data?"" So we may be correct that that person in their profile said alcoholism, but we shouldn't suggest is as a skill necessarily. Right. Right. But I think the other thing that is interesting there is the use of crowdsourcing. So as a way in that first month when I was bootstrapping the system, I was able to get labeled data. I think the Wikipedia task was something like, we would show phrases in context and then ask them to label, ""hey, pick which Wikipedia entity is this phrase? Is this the correct one?"" And then that powers the machine learning training, which does it automatically. Got it. Cool. So you worked on it for like a year? Like, how long did it take you to get something that you could like to play into production? So I think that the prototype and with it, the front end and, the using SciPy and Hadoop, Pig, etcetera that took maybe about two to three months to get something reasonable. And then there is a bunch of design work and a bunch of engineering works. So the engineering, taking something that runs on your desktop or your laptop to do a prototype app that recommends skills for people, that's one thing. But at the time, the way that our production stack worked, we had something called Voldemort, which was an SQL datastore, and we had Hadoop jobs that would then push metadata, essentially. So let's say you extract suggested skills for all of those 10 million members, you would compute those suggested skills, push them up to an SQL Store, and then a recommender service which sends data to the frontend would have to pull from that data store and display it to the user. And then there's all the logic around, Did someone accept the recommendation? Did they decline it? Tracking, which would then go to Kafka eventually, all that machinery and all that engineering, that's where folks like Jay and Sam Shaw came in and design ed that. The other hard part, I would say is you always have these choices of, ""Did we do the thing to get it done quickly or did we do the thing to set us up so we can do 10 more products like this?"" And in those early days, we were at this transition point. We had done in the past a lot of things, like I mentioned, the three day SQL query. We wanted to do things in a bit more scalable way. So we bit the bullet and a lot of things like Kafka and other projects came out of those efforts to make it more scalable. So it was like two to three months to make the prototype, how long did it take you to get it out? I think it took about a year to get it out. It was phased, right? This is another good point; people should look for opportunities to do an MVP. So the way that we approached it was we actually did email first so it was much easier. Changing the front end of LinkedIn at that time was a big, heavy process and we had this framework, which is pretty hard to work with. It could take an intern about a month to learn how to commit a change and push it to the site and so email is much easier. So if we could push the data out of the loop to an email job, you can imagine something like an e-mail campaign on MailChimp - we weren't using Mailchimp, but something like that - you could push the recommendation. So I could send an email to you saying, ""Hey, Lukas, do you have these skills? Adam's your profile."" It's a much lighter weight way to do that. That, I think, took another few months to get all the pieces powered at the backend so that we could do that and then the work to get the frontend done and actually roll it out in a task was probably another few months. So all in, it probably took about six to seven months at that point in time to get this on out for all users on LinkedIn. Must have been satisfying when it was deployed to so many people. Yeah, I think it was actually the first strata. So I remember D.J. Patel had a keynote and he was going to announce it, but it wasn't quite ready to ship. And so I think I had a talk a couple of days later. And yeah, I think we announced it sometime around then. Right around the first strata in 2010. Cool. Hi. We'd love to take a moment to tell you guys about Weights and Biases. Weights and Biases is a tool that helps you track and visualize every detail of your machine learning models. We help you debug your machine learning models in real time, collaborate easily and advance the state-of-the-art in machine learning. You can integrate Weights and Biases into your models with just a few lines of code. With hyperparameter Sweeps, you can find the best set of hyperparameters for your models automatically. You can also track and compare how many GPU resources your models are using. With one line of code, you can visualize model predictions in form of images, videos, audio, plotted chats, molecular data, segmentation maps and 3D point clouds. You can save everything you need to reproduce your models days, weeks or even months after training. Finally, with Reports, you can make your models come alive. Reports are like blogposts in which your readers can interact with your model's metrics and predictions. Reports serve as a centralized repository of metrics, predictions, hyperparameters tried and accompanied notes. All of this together gives you a bird's eye view of your machine learning workflow. You can use Reports to share your model insights, keep your team on the same page and collaborate effectively, remotely. I'll leave a link in the show notes below to help you get started. And now let's get back to the episode. So you went on to start SkipFlag. What was it like going from a bigger company into your own startup? What was the experience like for you? It was interesting. I think the part that I left out in that transition was moving into management. So I had managed projects and small teams before, but as LinkedIn group; when I joined LinkedIn, there were about 300 employees. When I left, there was over, I think six thousand. In our data team, I think, obviously like the Facebook data team and LinkedIn, there are a number of data teams at that time that grew pretty fairly large and so like everybody else in a hypergrowth company, we had to learn a lot about how to run data teams. And one of the challenges that I found was that the tools we were using as enterprise tools, essentially, and workplace software was still really dumb, right? So you're building all these cool, smart systems for Facebook and Google on the frontend, and LinkedIn but the tools that all of us are using at tech companies are still pretty stupid. And so what I really wanted to do was apply some of that technology to those workplace problems and more specifically, intelligent assistants. So moving to a startup, what was it like? I think that it's not as obvious when you're in these larger companies where you've hit scale. Everybody specializes. And you have... there was a great joke actually on Twitter last night. I forget who said this, but it might have been Parties, she was at Twitter in the past, I think. Anyway, she said who came up with Data Engineer? It should have been Data Lakers. [laughs] Data Lakers are Data Engineers who are doing all that hard work, creating those new SQL stores and infrastructure. I think a lot of folks go off to launch a startup, then the reality hits them that while there's actually a lot to build, this was like 2015, even with Amazon Web Services or Google or whatever there's still a lot of pieces and a lot of glue that you rely on and companies like Google and Facebook that is just not there and you have to put together yourself. So that was a big journey; building a lot of that. But I would say overall, it was extremely fun. When you're at a big company, you end up spending a lot of time. There's a lot of impact that you can have but at the same time, you spend a huge amount of time on coordination, and red tape and getting everyone on the same page. And one of the advantages of a startup, obviously, is you can move a lot faster. You make mistakes, but they're your mistakes. That was really exciting. I've never asked you this. I'm actually genuinely curious about this one. You had an enterprise tool that helped you organize information at SkipFlag. I've never had this problem as an entrepreneur just because of the space I've gone into but how did you prototype it? Did you get someone to give you their slack logs and like do it for them? Or how do you even build ML logs without the data? Yeah. So that's the chicken and egg problem. So I'm a little OCD when it comes to data and datasets. So back in maybe like 2007 or 2008, I wrote a blog post, some datasets available on the Web and it was in those early days. .. so Yaron Schwartz was another person who was really big on making datasets available and open. And he famously got in trouble for scraping academic journals. So he was working on some projects in this area. I'd been collecting a lot of datasets, a lot of public datasets for years and using the Twitter API, for example, I had been doing firehose crawls for years, Wikipedia, I mean, I'm an adviser to Common Crawl so that was used to create glove and a lot of these other N.L.P. datasets that we all enjoy today. Is the end run corpus still relevant? Is that still around? I remember working on that. Funnily enough, it really is. We did some work. .. I don't want to give away too much secret sauce or details, but we were working with one enterprise customer. Let me say a little more about the tool. So how did we get going? So what Skip Flag was doing; we were a knowledge base that would build itself out of your enterprise communication. We started actually with Slack because one of our first investors was Excel. And so I actually did an E.I.R. In Excel and was hanging out there at the time as we put the company together. And they were investors in Slack. And so I was using Slack a little bit before it launched. I think that was maybe like 2014, 2015 and I really liked it. I'm pretty picky when it comes to workplace tools. And I enjoyed it. Obviously, they've been massively successful. You know, tons of people use slack, but it felt like an opportunity for a dataset so nobody was really using that dataset yet. It put some unique challenges similar to Twitter in that, there had been email startups, many email startups before and they're startups working on documents but Slack was interesting because to me it felt closer to Twitter - short form messaging data, very hard to do the kinds of things that we were working on, knowledge extraction in any disambiguation. But there's a lot of data and it's accessible. So they had a pretty good API in the early days in terms of actually pulling public channels slack data. So the initial way we bootstrapped was actually using slack and one of the hard parts is if you were to build the whole product; I think we still have a video online from we did a paper and Katie D and they have you do a short video describing the paper. So we did a paper on energy extraction on ONJ noisy text and in the video we have a short product demo, so we could put that in the show notes so people can check that out. But before we got to that full blown product, which looked a bit like if you use notion or other modern Wiki-like products, it looked a bit like that, except it had this A.I. infused that could auto-organize all your docs and answer questions, you could upload eventually, ultimately, after a required work day. One of the things we worked on was you could give it like a PDF of your workplace H.R. policies, and it could do fact extraction across all that and then automatically answer questions, which is pretty cool for an H.R. person to have this thing automatically answer those questions based on just a document. But before you get to that, how do you do this with little data? So basically, I went around with my friends, you know, I got about 100 or so startups that I knew, got them in as beta users and said, ""Hey, Slack is confusing. It's noisy. It's hard to sift through. What if we gave you a smart e-mail digest and we just summarized what's going on in your Slack team so that you can keep up to date with what's happening and see interesting stuff? And oh, it'll have news articles recommended based on what you're talking about and things like that."" So we did that. We did that prototype. And to do that, we had other auxiliary datasets and we could do a bit of transfer learning and things like that. So we had the Wikipedia Corpus Common Crawl, which is a fairly large dataset for N.L.P.. I think it's a few terabytes of Web crawl data. So we were able to train and bootstrap on open datasets and Web crawl data and Twitter data in combination, then with customer data to train and do something like smart summarization and extraction. You were going to mention the End Run Corpus. Did you also use that? Oh, yeah Slack actually came much later. So we started with Slack. We did the e-mail digest. And then in parallel, we were building out the product that became SkipFlag. One of the things that we found was that Slack is great but at the time, this is like 2016, 2017, larger companies were still not all in on slack; and I think they probably still are not all in on Slack. Teams is getting a lot of adoption obviously, things like that. But the world still runs on email, right? So email is a big deal and we had worked with email before. My co-founder, Sam Shaw actually ran email relevance at LinkedIn. And so we're pretty familiar with working with email and then run Corpus. I actually took a class; Leslie Cabling, I think back at M.I.T bought the corpus from Enron or they're bankrupt or whatever. So she bought the dataset and that's how it became an open dataset. She curated it and put it out there anyway. Long story longer, as we got into email. .. that's one of the few public email datasets out there. So when we would want to show a customer how well this could work, that would that was the dataset that we would benchmark on. And a lot of academic papers still use that as a benchmark forever. It's funny, I remember working on it in grad school and I just kind of feel sorry for all the employees that look at it, even though it's released. It is. I think it's a good lesson for me because now I'm really religious about keeping work stuff at work and personal stuff personally. You don't realize that there's nobody more careful than a data scientist or engineer for sure. We were really careful from the beginning and really rigorous. We had PII scrubbing and all kinds of stuff in our machine learning pipeline from day one. How did you do automatic PI? So there's a bunch of techniques out there. So obviously nothing is foolproof but this actually goes back to that first week at AOL. I had just started, so I was almost in quarantine. So I hadn't been involved with the release of the search dataset. So for whatever reason, I was tasked with going in and putting in place a bunch of the PII protection. So you can imagine in search query logs what are common things that you... Wait. so you out in place what PII was relevant for the AOL systems then? I didn't put it fully in place. I was a data scientist. I wasn't really doing the production engineering, but I did put together the scrubbing layer, which was things like, ""OK, how do you detect FedEx I.Ds?, how do you track a Social Security number or is that kind of stuff? but this was years ago. If you are going to do it now, Microsoft actually has an open source project for this. And there's a whole bunch of others there. There's about a half dozen open source efforts to do this kind of thing now. So tell people what's the AOL query story? So I think I mean, you should really have Ron on. I don't know if he's going to talk about it. So yeah. My job as a chief scientist at Twitter and previously who was driving search at a well... Tt was a strange situation because technically nothing was out of the norm in that. .. So you mentioned the end run dataset. At the time in academic research, the world of search, there were a few datasets. There is an excite logs. NSN had a dataset out there of search query logs. And so there was kind of an accepted format. There were like two or three, Maybe Lykos had a dataset out there. There were two or three datasets in the academic world that were search query logs. And so AOL basically released one in the same format. So I think they didn't really expect anything like this to happen but what ended up happening was Reddit. .. This was in the early days of Reddit too, right? I remember I was actually looking at Reddit on the weekend when I just started this job and I saw on the new page where emerging stories are popping up. I saw this story. AOL releases search logs and in some user I don't know if they were the first person to see it on Reddit or if it was a reporter but someone at Reddit like said, ""Hey, check this out. Their search queries' here. And then what happened was a whole bunch of people started putting up sites where they put a Web app in front of the search query logs, and you could go in and explore crazy queries. And query session logs are deeply private, insensitive things. So even if you remove the user I.D., this is what the world discovered basically. When that happened, hey, it was a wakeup call. I think a lot of people in technology already knew that obviously this stuff could be sensitive but I think for a lot of the world, it was a wakeup call that what I type in to my browser goes somewhere and it can have an impact and so if you think about the advice back in the early 2000s dealing with Google, the general advice out there was always Google your name, like every few weeks. You should Google your name to make sure that there isn't something bad in the Internet about you or whatever you like, reputation management. But actually, what that means then is in those anonymous search area logs, usually one of the more common things that people were Googling was actually their name. So then that made it fairly easy to triangulate in many cases, individuals. So ultimately, the search dataset was taken down and then I think we entered into this period where it was much more difficult for academia. Like you mentioned, the end run dataset. I don't think something like that would happen today. You wouldn't have an end run dataset. You wouldn't have the AOL search logs. And so it became much more locked down. I think the Netflix prize dataset was one of the last big ones. Yeah. And I hear there were complaints; I think Netflix didn't actually do a followup to their competition because there is privacy you brought up so... Yeah. I mean, it did. It was a good run and that kind of correlated with the rebirth deep learning. Because I remember I was actually working on the side when it was going on. I think I was at AOL at the time and I was working on the Netflix prize and I was in all the forums and it was like, this is before Kaggle but it was basically one of the first. .. there were the data mining competitions and then there was the Netflix prize and then there was Kaggle. And it was really interesting to see the progress because deep learning kind of came out of left field and ended up working really well and then its ensemble techniques. But I think that that period was the catalyst for a lot of what day-to-day folks like you and I deal with in machine learning. And this massive surge of progress, I think is largely because of these benchmarks and then things like imagenet. So anyway, I know we're on a tangent here, but I think that that was a really exciting period and we're seeing the compounding effect of that now where the technologies that people have at their disposal are amazing compared to what we had 10 years ago. Totally. And it's so powerful. I really want to make sure I get this in before we run out of time. You've been lately doing some consulting for different companies, what are you seeing out there? I'm really curious. What kinds of stuff are people doing and what kind of technology are they using at this point? Is Python and C++ still the standard like what's going on? I haven't seen C++ in a while actually. If you're using devices and things, maybe. But, it's interesting. After the acquisition of the startup, I took a break, started doing some angel investing and I get people paying me periodically to come in and help with strategy and help with consulting and running data orgs or rebooting data orgs sometimes. And so it's interesting; deep learning go back three, four years ago, it was seen as a risky proposition, unless you're a small startup. So bigger companies were not, I think, doing it as much in like 2014, 2015 but now it does seem like everybody wants to be using TensorFlow, PyTorch, things like that. I think also, obviously, the cloud providers have become a big player. So a lot of people are using Sagemaker. They might be using the Google cloud platform... Do you have stuff you recommend, like when you come in? Do you have an opinion as to what people should be using? I'd say I am still somewhat agnostic. So generally I tend to use Amazon myself, but I'm open to using other tools and other platforms. And so for example, I've got a camera here so, this is pretty cool. So I've got this as jeur smart camera, so I'm going to play around with that. But that's the cool thing; is like I mean, if you're using TensorFlow, all the big players use all the same open source stuff. So I can run TensorFlow on Microsoft or Google or whatever. Then I think it really depends on the problem and it depends on your company stack. I'd say if you're already all in on Google, then using a lot of the Google tooling can make sense. And so that's where I think I'm not religious about one platform or another. I think they're all converging to some degree. But I have a lot more experience with the Amazon stack, probably like most people. In terms of what I see at these companies, I think that what ends up happening, is actually similar to how things were a decade ago, I guess, in that data science, when we changed the branding from research scientist or machine learning scientist to data science, a lot of that was because you needed people who could put stuff into production and that production, machine learning engineering and data engineering was different than an academic who can write a paper. And so I think that you do have this challenge when it comes to hiring and shipping products. Managing research scientists is something that's difficult for anybody but it's especially difficult if that's not your area of expertise. If you're an enterprise company, if you're building workplace software or even if you're a consumer company, typically, they'll have a V.P. of and or something try to manage those teams. But if they don't have background, it can be really hard because planning is hard. Prioritizing those projects, knowing what's likely to work, if somebody hasn't done it before and then you hire some people out of school or people who've worked on Kaggle competitions. There's a lot of pieces that they're missing. So we actually ship and execute. Do you think there's something different about data stuff than other things? Because if I'm a V.P. of engineering then I can't be an expert on, you know, dev ops and architecture and all these things so I have to rely on folks. Is there something that makes data science a particularly challenging in this way? So, in all these other areas, the tools and technologies definitely change and people have to keep up with them. I think one of the challenges with machine learning is partly because of the people who work on it and partly just because of the nature of the field and how rapidly it's changing. They're always trying the latest thing, right? And I think that's very hard to manage, whether it's even just something as simple as library dependencies or methodology. Things are changing rapidly and that is at the root of it. I think the other thing is, if you're doing dev ops, dev ops at Company A probably looks a lot like dev ops at Company B, right? You choose your stack, you choose your tooling and then you live with the consequences of those decisions. But for machine learning, almost every problem is different. It's like that saying, every family is dysfunctional in its own unique way. I think the same thing is true of machine learning teams and projects. If you're trying to predict financial fraud and before you are working on detecting porn and user profile images, those are two vastly different problems. So SREs may look like SREs across companies, but machine learning problems are not all identical. Interesting. Let's stick with this article you wrote; What You Need to Know About Product Management for A.I. It seems like the best place to start with this. It's a question I have about a lot of things written about AI, which is what makes product management for AI different than product management in general? So I think the main difference between product management for A.I. and product management for traditional software projects is that machine learning software is inherently probabilistic, whereas classic software development is more deterministic. Ideally with software, you have this rich methodology around unit tasks and functional tasks, integration tasks and builds, and you're working on developing the software and you expect it to always behave the same way if you've instrumented the right tasks. That creates a very clear, comfortable development process and both engineering leaders and architects and product managers are comfortable with that, right? So most product managers like to run projects that are predictable. They like to be able to commit to deadlines, to work with partner teams and customers and be able to commit to a date. If you are mostly building things that are clear and that are understandable, or things that you built before you can come up with good estimates if you have enough experience. I think with machine learning, the uncertainty comes from a bunch of places but it's all the way down to the individual algorithms. So if you're training a model, there's some amount of randomization, different random ways it can lead to different results all the way to your approach. Your approach may be different. Every problem is a little bit different for machine learning. Otherwise, it would be essentially solved. So there's always something different about it. Maybe it's a slightly different application, different dataset that you're using that's, let's say, a movie recommender. If you're going to recommend videos on TikTok, that's on the surface seems similar to recommending Netflix movies but if you peel back on The Onion, it's really pretty different. They are short videos. There's not a lot of context. They're very fresh, very new, every day and they're user generated, right? So you don't know what is going to be in that video. And there's not a lot like dialogue. Right? There's just something interesting. It's more visual versus Netflix that has a curated catalog of blockbuster movies or self-produced movies where everything is very carefully controlled. So on the surface, those both look like recommender problems but for a machine learning person, they would realize, ""OK, there's a huge set of different things it would have to do for TikTok than it would have to do for Netflix."" So that's just scratching the surface. But why are these different? Really, the planning process is often very different because it's very data dependent and very application dependent vs. you know, a user sign up flow looks very similar across many different software applications. So what does the planning process look like in the face of this amount of uncertainty? I think there's two things: there's what maybe the planning process should look like and then realistically what it looks like in most companies. So I'd say in most companies, the pattern I've seen is people just do what they know. They continue to try to plan these traditional software products. I think the better teams are aware of some of these issues and and they treat the uncertainty from day one and they build it into their planning. So it's effectively, most machine learning projects are much closer to R&D than they are something very clear and easy to execute on. So I think that the best ways I've seen a plan, involves first starting with what are the core problems that matter to your business. So one of the problems could just be the set of machine learning projects you're working on may not be the right ones. So typically, companies have some kind of product planning process roadmap building. They may do this quarterly. They may do this annually, where they come up with a set of funded projects and that they're going to staff, that they're going to resource and they're going to execute on. And so I think fundamentally, you need to have a clear set of projects that align with your company strategy. So let's say, you're a consumer app and growth is important. So one of your key metrics maybe daily active users and time on site and sign ups, things like that. So clear business metrics where if your machine learning project has an impact, you can see the number change, right? What you don't want to have happen is you don't want to spend six months to a year working on a machine learning project and then at the end of it, you can't see a material impact in any numbers. And this does happen a lot. So a lot of people, I'd say, especially in enterprise, machine learning is seen often more as a feature, like an interesting checkbox to have, but it's not necessarily tied to a clear business outcome. Why do you think that happens? I mean, a lot of folks have talked about it that we've talked to, but it sort of seems like connecting a project to a business outcome is something that's a best practice for any kind of project. I do hear this over and over so there must be something I got. What do you think it is? I think some of it is just lack of familiarity with the domain. So in some ways, unfortunately, it's like blockchain. People hear a buzzword, they hear blockchain, and they say, ""OK, we need to have a blockchain story."" So I think when these things start top down sometimes, that happens. So the company may say, ""OK, our board is pushing us to have a blockchain strategy"" and then they get some consultants in and maybe they have internal execs come up with something. And then I think when these things tend to be pushed top down sometimes.. It can be good to have executive sport, don't get me wrong, but I think you do need the bottom up expertise and experience to connect those dots. And that's really where product management shines. So I think if you have a good product manager who's thinking, who's very numbers driven, that can help. And I do think that tends to happen more in these instrumented companies versus enterprises usually more sales driven. What about specifically addressing the uncertainty? Say you have a thing that's connected to a business outcome but, you know, like you said earlier, with ML, it's hard enough to even know how good of a system you can build, so how do you plan around that level of uncertainty? So there are a number of different strategies. One of the ones that I'm really honing in on lately is an old strategy. It's what Darba used for self-driving cars. It's what Netflix used for the Netflix Prize and you know, all the data mining competitions used for years, which is benchmarks. So I think if you have a clear ... you could go back to video recommendation again. Netflix created a dataset, a benchmark dataset, held back some test dataset and released training data for people to train models and then had a clear set of evaluation metrics. And they said here's your current state-of-the-art or what is in production right now and we're going to pay $1million for the first team to get a 10% improvement. I think it was 10%, right? Yeah, 10%. Yeah. The nice thing, and this will appeal to, I think, a lot of product managers, they like clear objectives and goals that people can rally around, that your team can rally around. So I found that really effective. I found two things. If you can't construct that benchmark, and it's surprising the number of teams that actually skip ahead, they just don't even bother with that. They may have some business metrics they're measuring and they may have model metrics that they're using internally, but they're not really connecting those in a clear way and they're not doing it. .. Like, for example, it's like testing and production. They may do something where they have an AB test and they may say, ""hey, when we rolled out this new model code, in our AB task, we saw a 5% lift. So it's better than the old one. Good job. Work on the next model."" And if you're doing that, it's very easy to fool yourself and it's very hard to debug. So that's where the uncertainty creeps in; you don't really know where you stand. You don't know, well, ""Was there something else happening in the data during that time period that affected the model?"", ""If we reran that same model on new data, would we get the same result? "" And so this is where I think experiment management is one where you could frame this. It's really critical that you build those benchmarks and that you hold out some dataset, some stream of traffic, for example, and you keep running on multiple models so that you can ensure you haven't regressed in terms of performance. Every one of the things that you said to me in a private conversation was talking about standups with ML teams feels a little different than setups with engineering team because it's like hey I wrote this feature, but then with an ML team, it's a little harder. Do you have any thoughts around that? Yeah. So I think I've mentioned this on Twitter a few times as well. And it's interesting to see the discussion that people have around this. So I think with a lot of data scientists, it resonates. In that, and some of that may just be understanding the nature of the work that scientists and machine learning researchers are doing and then how to translate that into that kind of standup format? I think you see the clash of cultures immediately in a standup because often people have, say you're doing, I don't know agile development or, you know, scrum or something where you have very clear chunks of work that may take one to two days in traditional software development. If all the people working on other parts of the product's launch have work that's easily chunked in that way, then they can close the Jirs tickets more easily. They say, oh, yeah, I implemented that API that will talk to the e-mail system and we're all set. And it's in testing. That's very different from they get to the data scientists in the standup and they say, well I'm still training the model and something is not working, but I'm not quite sure what it is. I'm going to look into the initialization parameters and maybe try to optimize that. And I'll report back next time and then repeat. Right. And it's always something. The model isn't working until it's working, unfortunately. And so I think that can create some stress. Maybe it's just me, but I feel that stress. And so in terms of strategies to deal with that for product managers, I think it's at least good to call it out. I think if you don't talk about it, it can start to seem strange. So I think it's at least worth calling out. And this gets to the point of organizational support for these ML projects. If you listen to chatter, there's all kinds of apps for back channels now. There's Slack and there's other things like Blind where people talk about their companies and especially in an environment like now where there's economic uncertainty and pressure, I think increasingly you're going to have this chatter which is already there around, ""Hey, what are those ML people actually doing? They're getting these big paychecks, where's the beef? What are they delivering?"" I think this is critically important. And like standups, I think it would be good to have more of a clarity around what should machine learning folks report in standups and make it clear that the progress meter is going to be a little different. And it may be research results like here's the objective we have this week, here are the things that we want to put in place and we may have accomplished them even if the results are not there. I think if you say, ""Hey, we're gonna improve by 5% this week, and that's our goal for the standup"", that can be very hard because you may not hit it. Chris Albon also on this show said he was talking about a lot about creating a sense of emotional security for his ML team and I think a big part of that for him was not focusing people too much on this sort of external 5% increase goals. I have to say he made some sense but I think I was a little convinced for myself. I mean, I feel like as someone running a company under pressure, I do feel like for me, I think the way that I run my teams is pretty sort of external-metrics focused. Yeah. But I think there's downsides to it for sure. And it's very hard to know what a reasonable goal is. So yeah, I go back and forth. So I think it varies depending on the stage of the project and of the company. So when they're just starting with machine learning, I think the hard reality is it's gonna be hard to get to that number when you don't even have your ML infrastructure in place. So I think in the early stages before you actually have a working product in production, unfortunately it's gonna be really hard to be metrics driven. So you may have decoupled, you may have some set of people working on sample data, training a model and maybe quickly you can get to a benchmark. What I would suggest people do is, so I kind of agree with you and I agree with Chris, but I think that you have to encapsulate it in different ways. If you have no part of your team that's numbers driven then you're in trouble. So what I would say is, let's say you're starting a new project, let's say it's fraud detection and accounting or something and you're going to roll out that model. It's one of the things when you get back to project managing and planning and how you run these projects, as quickly as possible, you need to get to a benchmark data set. So I remember Leslie Kaelbling, I think we talked about her before. She's a professor at M.I.T; I took a machine learning course she taught years ago. There was a project in the course and people had to pick projects. So in some ways, it's similar to picking a project in your company. And she said one of the most important things is if you don't have the dataset in hand now, pick a different problem, because you're going to spend the whole semester just gathering the dataset. So I wouldn't necessarily give that advice for a company. If you don't have a dataset, maybe you do need to gather it but if you can have that dataset ready to go and get people working on it, benchmark right away, and then you can get on this nice track where you can track progress. We have a white board with a goal, you know, ""Hey, in two weeks, this is the number we want to hit."" And maybe it's like AUCX and everybody would just keep their eye on that number and we'd know what we were shooting for. And it would be like a kind of a two-week horizon. I think, yes, two to three weeks. Now that's once you have a working model. Once you have a baseline model, one of the most important things you can do is just start with an MVP, get something basic in place so that you can get on that AUC improvement training. And once you can do that, then you can create this momentum where the team feels like there's progress. And I think for project planning, then it becomes more clear. So once something shipped and assuming it is tied to a business metric where you improve AUC and then you see revenue increase or users increase or something, then it becomes very clear what impact your team is having. And so I think this solves that backchannel chatter issue as well. Part of the PM's job here is to keep people moving towards that objective but then also to communicate it to the rest of the company. So a good weekly status emails, getting the company to update emails that go out to everybody and make it very clear that, ""Hey, we have these model improvements which cost X or Y to our bottom line metrics."" That makes total sense. Yeah. So anyway, I don't know what Chris's comments were, but it sounds like shielding your team so they don't feel overly stressed by metrics can make sense but I think that's more important in deep R&D. So, for example, like Google, they would have 20% projects, right? At LinkedIn, we do this as well, except it's not as simple as, ""OK, one day a week of someone's time. "" Often what would happen is entire sets of people would be 100% on a 20% project. In ML, that's really what you need to do sometimes and so shield those people. Definitely. I remember years ago, the DEF CEO told me that he basically gave out bonuses to his ML team based on the lift that they got on the projects that they were working on. And on one hand, it seems like a very fair manager strategy kind of pushing down the decision making to the folks working out. I guess, on the other hand, I've worked on many projects and I've been really surprised by not making any progress or being the progress that I wanted. So, you know, I think that could be a more stressful environment for sure, right? Yeah. I think in the article I mentioned something related to this. I'm not sure if I mentioned OKRs explicitly, but I do talk about setting goals and setting objectives and then how that can go wrong. So I think this is a great example. I am a believer in that general framework. I think what becomes difficult. .. So for people who aren't familiar with product management measurement in OKRs, what typically happens is at the beginning of a quarter, everybody signs up for an OKR and you're not supposed to stand back, you're not supposed to set the bar low. You're supposed to have a reasoned, ambitious goal. Typically have something like three OKRs per quarter and it might be increase user signups by 20%, increase revenue per user by, you know, 10%, something like that, and they may be a little more grandular , right? Especially in a larger company, they become more grandular. It may be something like increase search relevance as measured by F1 or whatever by 30%. In any case, those metrics are important and I think the hard part is among the product leaders. You have to be very careful about how much latitude you give on just pure ML metrics versus business metrics, because it's very easy for our teams. That's how you get this bubble where everybody is just doing R&D. And then I think what ends up happening is a lot of the business leaders and PMs see those OKRs and they just shrug and say, I don't understand how that relates to the business. Right. Right. Rewarding people for OKRs is pretty standard. And where that can go wrong, I think the YouTube example is one of the best known ones where by all OKR metrics, YouTube has been succeeding wildly over the last five or six years, probably. But the downside is when you manage to a single number, PMs become machines. They're like the parable of A.I. and paperclips where you build an amazing paperclip optimizer with AI and then it destroys the world to make as many paperclips as it can. So PMs are the same way where you give them a metric they're gonna hit it but there may be a lot of collateral damage. In the case of YouTube there is a lot of misinformation and conspiracy theories because they lead to clicks. So by the measure of engagement on YouTube, they're doing fantastic. But at what cost? Right. Right. I guess it's a good segway to another topic that you talked about in your paper that might be interesting to talk about here, which is building infrastructure to make your AI or machine learning actually scale. What kinds of recommendations do you have there? So that's a deep, deep topic. So I think there's a spectrum of companies. So companies like your Google, Facebook, they already are deep into building ML and they have all the framework so it's hard to say.. The advice that makes sense for them may not make sense for other companies, I guess, is one key thing to be aware of. And so, realistically, I would break out a few different types of companies. So I would say for your hypergrowth technology companies that are more consumer facing or enterprise sass apps that are in the cloud from day one, those are your modern technology stack companies. In many cases, those companies have good tracking in place or using some modern frameworks and tools. They have Kafka, they have things like that, and they probably have good data ETL. Now it varies quite a bit. So some companies just move so fast that things are duct taped together, right ? Even for successful startups. But it tends to be the case that those companies at least have a lot of the raw pieces in place so that when they do get to a stage where they want to use machine learning to make their products better, there's some amount of work, but it's maybe one year, or 18 months of work to get things pretty solid. I think the bigger challenge are more legacy companies or enterprise companies where organizationally, the very large organizations may have different data systems. It's typically hard on those companies to get access to data. People may even hold on to data and not want to give it up without 10 meetings. And even when you get the data, the idea of getting the data means different things, right? So someone gives you a dump of data, which is static, is very different from, ""Hey, we want to do this in production. We want to do fraud detection. We need a Kafka feed. We need all this infra."" And so I think for those companies, my recommendation has been don't try to reinvent the wheel. Like, there are 20 or 30 companies, you know they see. .. So you see what Uber did with Michelangelo. What Salesforce is doing with Einstein. There is like everybody trying to build their own ML platform internally. So that typically happens when the Top-Down guidance says, ""Hey, we need an AI strategy."" Somewhere in those early discussions, they jump to the conclusion, ""Oh, well, first we need to build our own A.I. framework and let's give that project a name. "" This happens a lot in software development and big companies. It's project-name-driven development and they come up with a name that'll be our infrastructure ETL system. Let's go build that. And that might take two years. I don't know. Is this actually what you're seeing in big customers. Yeah. That's funny. You put my quote in the article, which was probably the tweet that I was made fun of the most; and that's actually big companies shouldn't build their own ML tools, and it's like, ""Okay, I am selling an ML tool."" But I feel I am incredibly biased, but it is like, I would say from my experience, it's just baffling how you go into enterprise companies in particular they do this, and they build so much infrastructure they could integrate cheaply or for free. And it's actually funny because you go into Google or Facebook, they've been around longer and they actually pull in a lot more open source infrastructure than companies less far along. So, yeah it's always surprising and I think part of it is actually that folks want to... Sort of you building ML infrastructure inside a company is a little bit of a career development path or... Yeah. I totally agree. I think a big part of it is incentives. So if you think about it, and this isn't just Silicon Valley. I think it's all over. You know, PMs might be rewarded for OKRs and hitting product metrics, a lot of engineers are rewarded for releasing a new open source framework, for giving a talk where there's some new infrastructure piece they built that everybody in the company is using. And so I think engineering and product leaders really need to think about the rewarding. And one way to think about it is maybe as reward leverage as well. So if I was in a situation where somebody made the choice to use an open source system or even to use a vendor, and they delivered ahead of schedule and everything's working, you need to find a way to reward that as well as just rewarding, ""Hey, I did a 18-month sprint to build something that is not as good as what I could get off the shelf."" Yeah. Totally. I think the hard part, you also deal with when you roll out to customers, you deal with the engineers on the ground. And I think there are good reservations around you know, sometimes using a third party thing can make people uncomfortable. They don't feel like they can adapt it or change it all their needs and so that's where I think a lot of the frameworks need to be really responsive to what a customer needs and how flexible they are. Because I think we've all been in a situation where you use some third party failing and it's too rigid and then eventually it causes a lot of headaches. And it does seem like a lot of the tools and infrastructure that comes out is built by a lot of engineers coming out of Facebook and Uber and others. They might not actually realize the different needs that other companies might have. Yeah. So I think when it comes to the infrastructure side, that's another common pattern is that people..., so I worked with Jay Kreps who is the CEO of Confluent. He was originally my engineering partner back at LinkedIn. Then we were building some of the data products we built. And he eventually went on, we had built a number of these things and then his work on a lot of the infrastructure pieces and focus on that and what eventually became Kafka and other open source projects grew out of, ""Hey, we've built this four or five times now. I think we should abstract this out. "" So that's a very different approach than saying, ""Hey, we've never done this, but let's go design what we think the right thing is and then build this abstract platform."" So I agree with him. I tend to think things that grew out of real experience tend to be better in terms of frameworks. So if you're selecting, make sure you're selecting something that as a product manager or engineering leader, make sure that you know the origin of the framework and ideally if you're at these companies your is building this, I mean, one of the best things you can do is just see more problems and map what you're doing to those customer problems. We always end with two questions, and I'm wondering how you're going to answer these, I think they're a lot of stuff we talked about but here's my first one, which is, what do you think is the topic in data science or machine learning that people don't talk about enough, like the underrated thing that in your experience matters more than people spend time thinking about it? I think it's actually constructing good benchmarks. So if I were to look at, you know, we talked about teams that are struggling or having trouble, I would say nine times out of ten, they haven't done the hard work to construct a crisp, clean, precise benchmark and an evaluation of how well their model is doing. And so what often happens is people have these notions,""Hey I'm going to build a recommender"", How do I build a recommender?"" and ""Oh yeah, we'll get this data"", and people actually just start building the model. And then after the fact, maybe they label some sample data and they say, ""Oh, this is my gold standard."" And that's maybe I'd say. A lot of the time people don't even do that. They just launch the thing and then it becomes very hard, or they may use proxy metrics like did it increase lift? Did it increase CTR? And maybe an AB test, and AB testing is not a benchmark, basically I would say. So build a benchmark, be rigorous about it and if at all possible. .. because the other thing that happens is when things aren't working. .. So say you're six months in and your model isn't working, you don't know why , you need that benchmark so that you can debug what's going on. And if you don't have it, you're going to flail. It's funny, we had a guy from Open AI on the podcast a while back and he said the exact same thing. He was mentioning that the Dota team spent about six months building a hand-tuned benchmark. He was saying there's a baseline rule based system. They really spent six months actually building the benchmarks. Yeah. I mean, I don't know the exact amount of time, but at our startup, I mean, I would almost say we spent 20 to 30% of our time on that kind of thing. I think it totally makes sense. The second and last question. When you look at, this is a good one for you, actually. So do you get all the ML projects you see on consulted on a part of... What do you think is the hardest part getting them from a model to a production deployed model? Where's the biggest bottleneck there? So from a working model? No, I would say from conception to say, here's the goal to a deployed model that people can actually use. I would say that the two... I'd say there's actually three hard parts, it's hard for me to pick just one. I would say one hard part is around actually getting the data. So a lot of companies, you were asking where did we get the data for the startup? Even within companies that seemingly have a lot of data, getting the dataset you need to train the model is often really costly and hard and there may be a lot of internal roadblocks. So that's one where I've seen people stumble. The other hard part about getting things to production is then actually the modeling approach. A lot of people, I see this all the time in blogs and on Twitter, say, ""Oh, the modeling doesn't actually matter that much. It's all these other auxillary things and it's commodity."" I don't believe that. I believe modeling is commodity at all. I think it's actually really hard to get models to work correctly, and especially when you move beyond that toy or benchmark dataset to real world data. Building something robust, and that works at scale is actually really difficult. So I'd say that's the second part - is actually the hard elbow grease and research work of getting a working model is usually hard, harder than people think. And then the last part, I would say, is actually getting buy in from executives. So this is a long journey to getting something out to production. And if you're running your own company and you're a CEO, that's one thing and maybe you can push it through. You think it's really important. But if you're in any large organization, there's a bunch of stakeholders, there's a bunch of business units, there's a bunch of engineering teams juggling resources. I think a lot of people struggle just to convince companies that it's actually a priority to push out their machine learning effort. And so I think that's where I'll go back and plug that product management article that I wrote. I think it's really important. You need somebody who's your advocate who's like.. It could be your head of data science or V.P. of data or it could be a product leader who is driving AI. But if you don't have a seat at the table, at the exec table that believes in this and is really supporting it and pushing it. A lot of these things will die on the vine. Interesting. Cool. Thank you so much. That was a lot of fun. Yeah, man. I like your garage. [Both laugh] Thanks",13442 +Josh Tobin — Productionizing ML Models,https://www.youtube.com/watch?v=G6AgmZ6_R3U,2899,2020-07-15,"The question that I got interested in was, is there any way to learn behaviors in a physics simulator, where you actually have access to hundreds of millions of labeled examples but then somehow make that work when the robot is put out into the real world You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Josh is a researcher and entrepreneur, his work at OpenAI was on Sim2Real creating virtual environments to create virtual data for robotics. He also teaches my favorite class on machine learning called full stack deep learning and if you haven't taken that class you absolutely should. Previously he did his PhD in computer science at Berkeley under Piter Abbeel, I'm super excited to talk to him I think for a lot of people listening to this, just knowing our demographics, I think a lot of people would probably be most interested in learning about machine learning and they might even know you from some of the classes that you teach which I think, in my opinion, are some of the best classes out there. I've learned a lot from watching you teach and I'm curious how did you get the idea of teaching a class? How did that come up? It sort of all started to happen around two years ago. I was working at Open AI at the time and Open AI was going through an interesting transition, I would say. When I first joined, it really felt like a very traditional academic lab. It felt like the lab that I was at at Berkeley, except with more resources. And, you know, at some point they figured out that there was a type of work that they were uniquely suited to do that a typical academic lab is not well suited to do, which is this sort of larger projects that involve instead of just a couple of researchers working together, maybe a team of twelve or fifteen folks, you know, a mix of engineers and researchers with bigger budgets, more ambitious goals and really, really trying to push out these projects that would clearly mark a move forward in the state of the field. And so while this is happening, a big part of that change was we needed to figure out how we are going to professionalize our process of building machine learning models, right? And so on the robotics team, which I was working on at the time, we were figuring out stuff like how do we write good tests for machine learning code? So that you don't lose the ability to train a model that you were able to train a couple months ago, which happened to us multiple times. I actually manage a team that has both folks that are doing speculative research stuff that may not be able to really measure their progress in any given week and also people who are doing very traditional engineering work. That is, where you can easily say this is the goal for this week and have we met that goal? So, we were trying to sort out all these things. And around the time I was talking to my PhD advisor at Berkeley, Pieter Abbeel, and a friend of ours, Sergey Karayev, who was running at the time, a machine learning for an education company called GradeScope, and we were swapping notes on how we were approaching these things and how Pieter had seen other companies approach these and how Sergey had approached some of the stuff of GradeScope. We realized that there is this whole emerging engineering discipline I will call it, around, you can go online and learn the math and the algorithms behind machine learning; you can learn what neural network is; you can even really learn how to use TensorFlow and how to code this stuff up in an effective way, but at the time, there was very little on everything else that you need in order to actually make this stuff work in the real world. So, you know, not only the things I described, but also, how do you troubleshoot models? How do you choose projects? How do you manage teams? How do you deploy things into production? How do you monitor them once they're in production? And so we realized that everyone that we knew was reinventing the wheel on all of these practices and the number of people that are actually good at this is very small and they have to be tracked in a small handful of large technology companies in the Bay Area, let's say. So we just thought it will be really good for the field if we wrote down everything that we knew about the stuff and everything that our friends knew about it. And so that was the genesis of the Full Stack Deep Learning class. I guess what's amazing, I hadn't really thought about it this way but I feel like I spent my career and I'm a little older than you, studying, making machine learning models work in the real world. But watching your class, I'm learning a ton and I'm seeing you as the expert. How did you get up to speed on this stuff so fast? Was it just the experience at Open AI, because your classes are amazingly deep? Yeah, it's a good question. I mean, I think I was at Open AI at the most interesting point for this, because we were figuring this stuff out from first principles. And so there are tons of conversations around what tests should we have for machine learning models? And it was a really brilliant group of people there who would like to take a problem and break it apart and look at it from the ground up. And so I think I was able to like look at things from, all the way down to the first principle's level through that and then I think it was really just about trying to talk to a lot of the folks that are working in the field and seeing how they approached some of these things as well. A lot of the content we put together in that class came from about thirty or forty interviews that we did with practitioners and just trying to understand. We had a good sense of, what are the hard things? And what questions do you need to ask if you're putting together an applied machine learning team. So just getting a range of answers on those was also really helpful. You know, you have a unique background, having been like a McKinsey Consultant for a number of years. Do you think that informs you at all like that? Do you think about how that might affect the way you approach this stuff? I think one of the things I learned from McKinsey was, how do you approach abstract problems? Like what should our company do? Or, you know, what does your organization structure look like? Show like these problems where it's like okay, where do I even start thinking about this problem? I think the question like how to make machine learning work in the real world has this flavor and then figuring out how to break that down into parts and structure your thinking around it is definitely one of the essential things that you have to do as a management consultant. And so I think that definitely informed the way that I looked at this problem. So is there a piece of your curriculum that you feel particularly proud of? I think the thing that I put the most, emotional energy into was the troubleshooting guide. That is actually my favorite part too. That was the piece that I would say I was writing it for myself a few years ago more than anything else, because I was just trying to answer.. to... My perspective when I was writing that was, ""how could I have saved myself months of time if I had gone and started over in this field?"" I would have to say that I got a chance to work for you briefly for maybe a month or two and I think my big takeaway from you that I always hear in my head is you are like, ""You should always slow down and change one thing at a time."" I mean, I feel like that actually applies to more than machine learning obviously. But boy, does it apply to machine learning like Oh my God! Yeah. It's so essential in machine learning, because fundamentally I think the thing that makes machine learning so hard is that, when you're writing software, like when you're writing code, we have this pretty mature ecosystem where if you make a mistake, usually the system that you're building or that you're interacting with will tell you. First of all, will tell you that you made a mistake. And second of all, might even give you a hint as to where that mistake is. But the insane thing about actually trying to make progress on machine learning projects is that most of the time when you have a bug, the only thing that happens is that the performance of your model doesn't get better as quickly as it should. And so there's no way of knowing that you actually made a mistake a lot of time. Unless you happen to have really strong intuition about what your learning curve should look like, right? And so I feel like that's why it's so essential to move slowly when you're building your machine learning models. Although it's kind of funny because I wonder if programming Web applications is the outlier here, because I think about just trying to make an advertising campaign work well or trying to get my old motorcycle running again, it's always better to change one thing. Yeah. Because it's so hard to tell what will happen otherwise. But I guess maybe we have this better telemetry or API, you know, programming python or if I don't know... I feel like that's right. I mean, I wouldn't consider myself a world class web developer, but when I've done it, it's also been helpful to still just change one thing at a time. I just feel like it's just good advice for all situations. Yyeah, it might be. Anytime you're building anything, I feel like as you get better at something you can increase the increment of what you can change at a time. For example, if I'm training an image classifier or something, I can pretty much just start with newer architecture, like a resonant or something like that, because I've done that enough times that I know what I can expect the result will look like if it works and what common things that can go wrong are. So I feel like I can skip a couple of steps but when I'm writing Kubernetes Code, right? Something I'm not very much good at, I have to still move very, very slowly. Would you be down to walk me through your troubleshooting steps and how you think about them, I know a few people will be interested. Yeah. You know, the core concept is what we've been talking about, right? Which is to start simple and then layer on complexity one step at a time. So the first question you might have is what does it mean to start simple, right? I think that one of things I've noticed with people that are getting to the field is that there's a tendency to get all this excitement around neural network architectures and the latest and greatest state-of-the-art model on image net. And so I think people tend to overthink the question of architecture selection and selection of all the other pieces around that; like what Optimizer are you using and things like that. But in reality, I think when you're starting on a new project, the goal is to just choose a reasonable default and start there, even if it's not state-of-the-art. And then once you've convinced yourself that everything around that is working, it's your data loading code and your training code and all that stuff, then you can gradually move closer to a state-of-the-art architecture. How do you convince yourself that this stuff is all working? Yeah, it's a hard question. I think there's some tricks that you can use, right? So the first thing that I recommend people do when you're training a new neural net for the first time is just make sure that you can first. I mean, first of all, just get the thing to run. Right. Like, literally just like output something Just output anything. Which is not always as easy as it should be. Let's say that you've done that. Then the next thing that I think you usually want to do is try to overfit a really small amount of data; like a single batch of data. It seems really simple and a lot of people skip over that stuff because of that. And, you know, 80% of the time, it's not really necessary, but 20% of the time you can catch some pretty nasty bugs early on. So, I often recommend this, citing you, and I'm sure that this is not obvious to most people, why do you want to overfit a small amount of data? So like any reasonable model architecture, optimizer, training loop and data type, you know, you should be able to get your loss down to zero on a single batch of data, right? You have enough parameters of the neural net, should be able to just memorize the data. And so, you know, basically if it can't do that, then you know that you must have like a pretty bad bug in one of those things. What kind of bug, for example? Like you flip the sign on your loss function, right? And your loss actually goes up rather than going down. Or, another one I see all the time is in a lot of these neural network libraries, the inputs to the loss function is maybe like the low digits right? So it's like something un-normalized. But, you know, maybe you took the softmax of that first and so it's things like that, right? Where it's just like you wrote the code the wrong way and this is a quick sense check for figuring out like, ""Is the code that you're running reasonable?"" Okay. Sorry I cut you off. Then what do you when you can overfit one tiny subset of your data. Yeah. So when you can overfit a tiny subset of your data... Then I would say, one way to think about the process of making your neural net better and better over time is there's like an outer loop and then there's an inner loop, right? The outer loop is basically you generally trying to do one of two things, you're either trying to reduce the amount of underfitting that your neural net has or reduce the amount of overfitting that your neural net has. And there's a lot of strategies for doing both of those things but the best strategy for reducing underfitting is to make your model bigger and for reducing overfitting is to add more data. And so if you think about what we just did with overfitting a single batch of data or with driving loss down to zero on a single batch of data, we're basically saying, let's take the smallest possible dataset and let's overfit it, right? And so now the next question in your decision tree should be, ""All right. Now we know that we're overfitting because we can drive loss down to zero. So the next thing that we should do is reduce overfitting."" And the simplest way to do that is to add data; but you want to do this gradually, right? So typically, what I would do next is I would move from a single batch of data to a smaller or more simplified version of the dataset that I was working with. So maybe it's like, I don't know, maybe it's a million images, you only take a thousand or ten thousand of them to start out with. Maybe you make a synthetic sort of toy version of the problem that you're working with. You know, if you're doing reinforcement learning, maybe you work with one of the standard simple benchmark problems like cartpole or something like that. And so you just make the problem one step more difficult than a single batch of data. I see. So you add one piece of complexity? Yeah, that's the way I think about it. Why wouldn't you just add all the data that you have? Because like your conclusions, I imagine that could change at different scales of data, for example. Yeah, definitely. I think there's two core reasons, right? So one is; a lot of time.. I guess maybe the simplest one to explain is that it just reduces your iteration time, right? So if you're working with a smaller dataset or a simpler dataset, then typically your model will change faster. It'll be cheaper. And so you can just try to output things more quickly, which is super key but I think the deeper and more interesting reason is that a lot of times in machine learning, you're working, you have some degree of confidence that this model should actually be able to solve the task that you're working on but a lot times you don't actually know that for a fact, right? Like, maybe you're doing image classification, but you are not doing it on image net. You're doing it on some other dataset. Maybe you're classifying whether a person is wearing a hat in the image or not. And so it's like intuitively you feel like it should be possible to solve this with a neural net, but you don't actually know that for sure. And so you want to try to isolate the sources of error in your problem, right? And so if one of the possible sources of error is that this data set, it's just too hard, then it makes sense to start with a version of the dataset that your model should be able to do well on. And so smaller datasets, less complex datasets allow you to do that. But wouldn't a smaller dataset make the problem harder? In what sense? Say, I'm trying to classify if someone has a hat on or not, if I have less training data, I would expect my accuracy to be lower, right? Hmmn. Yeah. That's certainly true. So I guess this comes back to the overall process that we're trying to follow, right? I think of it as iterating between eliminating under-fitting and eliminating over-fitting. And so if you're in a situation where your model is doing perfectly well on your training set, then it makes sense to increase the complexity of your training set. If you're in a situation where your model can't do well on your training set, then you need to figure out, is it that my training data is too hard? Is it that I need a bigger model? Is it that I need a different architecture? Is it that I need a different optimizer, different hyperparameters? And so, working with a dataset where it's easier to get to that point of your model overfitting reduces the number of things that could be wrong with your model. Interesting. Are there more steps to this? I mean, that's the high level flow right, you know, solve your problem, make it harder, solve your problem, make it harder. And then there's details about how to make each of those things work well, right? Like, what are the steps you should actually try when you're underfitting and you need make your model more expressive? That's the overall picture. We'll have to put a link to this and some people can find it. Do you plan to teach more of these classes? I think so, yeah. We don't have concrete plans to do another one. I mean, it's not a great time for in-person classes. Maybe a virtual one.. Yeah. Maybe a virtual one. That could be fun. Do you have any advice I suppose to folks wanting to get into machine learning? I'm sure you probably watch a lot of students learn it or not learn it. Do they have any sense of what's required? Some people look at something like machine learning and they say, ""OK. This is a really deep field and there is a lot to learn here. There's a lot of complexity. So many papers, thousands of papers coming out every month and so I want to just drink from the firehose and try to learn as much as possible."" And then on the other extreme, there's folks that say, ""Look, this field is so complex that I want to just pick a problem and solve that problem."" And I think there's failure modes on both ends of that. I think I work with people who see the complexity of the field and react to that by just like learning more and more and more, but never actually really getting their hands dirty and figuring out how to make this stuff work for the problems that they care about. I think that typically doesn't work. You know, I've seen probably just as many people who don't want to deal with the complexity, like don't wanna learn the Math, don't want to understand how a concat works. I think that also limits your ability to make progress in the field, because ultimately, it's closer to a science than engineering discipline right now, I would say. And you need to balance spending time on actually doing stuff and following tutorials and making things work. And then also going back and backfilling like, ""OK, I've trained, convnet on this image classification task, I know how to write the TensorFlow code. Now, let me actually go back and understand how convnet works."" The folks that you've seen that have been successful, like, they have learned this stuff and have started to get good careers as successful people. Do you think they spend more time on the theory on average or more time on the practical hands-on stuff, or is there some other third thing that they're doing more of that makes them successful? I would say more time on the practical hands-on stuff. One of the interesting things about machine learning is that although there's a ton of complexity, there is a relatively small number of core ideas that you actually need to really deeply understand in order to be an expert in the field. Understanding intention in neural net is really important. Understanding how back propagation works is really important, but understanding all the different state-of-the-art architecture for doing object detection is not really very important unless you happen to be working full time on that problem. So I would say that the people that I know that have successfully learned the field have spent more time with a smaller number of ideas and rather than trying to read five new papers every day, they've gone out and talked to people and figured out what the five most important papers are. And then have spent weeks with each of those to really deeply understand them. But then have also spent the balance of their time actually trying things and implementing things. That makes sense. When you look at the papers that you've written, do you have a favorite? I think my favorite is actually the first one I was the lead author on, which was the Domain Randomization Paper. Cool, Sim to Real? Sim to Real, Yeah. Yeah. Can you tell me the real process of thinking of that idea and then trying it and how that all happened? Well, first describe the idea because it seems like one rare paper that you can really succinctly describe it. When I was starting to work in this field; the Intersection of Deep Learning and Robotics back in 2015, there was a lot of excitement around reinforcement-learning being applied to robotics. So with reinforcement learning, you have an agent that interacts with environments. It takes some observations of the environments, decides what action to take, and then gets a signal back from the environment, which is a reward that tells it ""did I do a good job or a bad job?"" And then over time, it iteratively learns how to interact with that environment and improve its performance on whatever task it is supposed to be doing. So it's like a very natural abstraction through robotics. And, you know, back in 2015, deep reinforcement learning was starting to have a bit of a renaissance. It started work really well on Atari games. I think 2015 or 2016 was when deep mind beat the best human players at go. And so people were looking at this and saying, ""Wow, this could actually be the most important technology to come to robotics in a really long time. And so I was early on in my PhD at that point and the exciting thing to work on was coming up with what's the best, new reinforcement learning algorithm; like how can we improve our performance on all these tasks? But I was very new to the field and I felt like it would not be very smart for me to try to compete with people who had been studying this stuff for years and had a lot of insights into what made those algorithms work. And so what I tried to do was think about ""OK, what are the enabling pieces that we need in order, actually, for a story to come true?"" For a story that deep reinforcement learning is having a big impact on robotics. And for me, the piece that was kind of missing for that story was that deep reinforcement learning is very powerful, but it's very data inefficient. Like all these state-of-the-art results that you see happen in environments where you can simulate everything that's happening. Because it takes hundreds of millions or more of interactions with the environments to actually get to the point where you have acute model behavior. And so for me, you're looking into this field from the outside. That was sort of the big question mark. Is there any way for us to get around the dat inefficiency problem for robots? Because going out and collecting a hundred million examples of a robot interacting with an environment is not very cost effective. Let's say. Google did this right? Arm Farm? So it's definitely possible but do you really want to have to have dozens of robot arms running 24/7 for weeks every time you want to learn a new behavior. Sure. Yeah. So coming back to this paper, the question that I got interested in was, is there any way to learn behaviors in a physics simulator, where you actually have access to hundreds of millions of labeled examples but then somehow make that work when the robot is put out into the real world? I was kind of working on this back when I was an Intern at Open AI and we had a really concrete problem that we were trying to solve. We were trying to set up a robot to make a stack of blocks. So it will like pick up blocks from a table and then stack them on top of each other. And the robot behavior was trained assuming that you actually know where the blocks are in the real world. And so then we needed to go back and backfill. Like, how do we actually find out? How do we estimate the position of each of these blocks in the real world? It is something that seems like a really easy problem, but actually when you think about ""how do you make this really work?"", it's more complicated than you'd expect. Honestly, it's so counterintuitive. I think even for me and probably for most people, that's hard. It's amazing. That's hard. Yeah. And I think it's not like the hardest research problem in the world, but when you actually sit down and try to go and make it work really well, it's very tricky. Sure. And so we're playing around with these different tag, you know, ArUco tags and methods like that where you understand the intrinsics of a camera and then it reads this tag off of an object and then it can infer the position of the objects, given the position of the camera. And we just found those things to be really fragile and honestly, not really accurate without investing in expensive setups and expensive camera equipment and stuff like that. We were mostly deep learning folks, right? And so the obvious question is, why don't you just train a neural net to do this? You know, train a neural net to take an image of a table and then say, ""OK, here's the position of all of the cubes on the table."" But the problem is that, where do you get the labels for the dataset that you collect? You almost need to solve the problem. You almost need to know where the cubes are in order to actually get the label dataset that used to train neural net, right? So it's a bit of a chicken and egg problem. And so this is kind of the starting point for me working on this Sim to real problem is like, ""this feels like the simplest possible example of a problem where maybe synthetic data, data from a physics simulator would actually help"" So then describe what you did. So the core idea is that if you just take data from a simulator naively and train a model on it, the problem is that there are quirks of your simulator, right? Your simulator doesn't perfectly match the real world and so the neural level overfits to any difference between the data in the simulator and the data in the real world. So if you didn't perfectly model the lighting, you didn't perfectly model the color of the cube, the neural net won't transfer. So the idea that we had was what if instead of just taking a single best physics simulator, you massively randomize every aspect of the simulator that's not critically important to solving the task? So you randomize the colors of all the objects, you randomize their positions, you randomize the position of the camera, you randomize the background and it produces images that are crazy and unrealistic looking, right? So they look like scenes from an animated disco or something. But what happens is that actually the neural net in learning how to estimate the position of the cube in all of these massively different worlds is forced to not rely on the parts of the simulator that are not essential for solving the task. So if the color of the cube changes in every single data point, then the neural net can't create a feature that depends on the color of the cube to solve the task, because that's just an unreliable piece of information. And so when we do this, it turns out that you can train neural nets on entirely simulated data. So no real world data at all, they actually work when they're deployed in the real world. Because you just cycle through lots of colors and shadows and other... Yeah, exactly. Exactly. You basically show the neural net every color and every shadow that it could possibly see. And so then in order to solve the task, it needs to learn that colors and shadows, are not important; what's important is the position of this cube-looking thing on the table. And so it's not overfitting to like all the details that are unimportant and so if the details that it is looking at are the ones that hopefully will transfer over when it's deployed in the real world. So how far can this generalize, has this been applied to more than second blocks now? Yeah, it's been applied to a pretty wide range of computer vision and robotics tasks at this point. I think it's been applied to.. My favorite random application was there was a paper about using domain randomization to train a robot to pick fish out of a barrel. Really? Yeah. Wow. Which is actually a really hard task because fish are very shiny and slippery. And in general, most object detection methods, computer vision stuff has trouble with objects that have a lot of reflections and things like that. So that's my favorite random application. But you know, it's been applied to folding cloth. It's been applied to a pretty wide range of computer vision problems. And I think the furthest this idea has been pushed was at Open AI when they used this technique to have a robot hand that solves a Rubik's cube. Can you say a little bit about why that was such an impressive task? Well, I guess there was some, maybe there was some controversy about.. Is this sort of a stunt or is this like a real deep task? Where do you land on that? So maybe the different sides of this issue would be on one hand, if you look at the types of tasks that people have been able to solve with robots over the years, a task of this sort; using a high-dimensional dextrous robot to manipulate complicated objects are very few and far between in the robotics world. And it's generally seen as high-dimensional, contact-rich, dexterous manipulation has been one of the grand challenges of robotics. And so I think one point of view on this is that even just something like a proof of concept level, to show that it's possible to even do this once, is a big step for the field because there's very few examples of projects that have pushed robotic manipulation as far as being able to solve a Rubik's cube. I think the other perspective would just be that, if you look at the details in the paper, the algorithm actually works about 20 percent of the time. And so you might argue that.. And it was a pretty big effort to actually make it work for the first time. So, a pretty big team working on it for a long time. And so you might argue that obviously, if you put 10 or 12 really brilliant people and have them work on one tiny sliver of a problem for a long time, then obviously they'll be able to make it work once. I would say that... That's not obvious to me. I don't know. [laughs] I'm trying to play devil's advocate here. My bias is that it's an important result in robotics. And I think that the perspective that you have to have when you look at this, is that it is very much a research result. I think a mistake that people make when looking at results like this, and I think this is true in AI in general; is that you look at humans, like computers being better than humans at any task and you say, ""OK, this means that robots are going to take this job in two years"", if you look at the details of how hard it was to actually make this work, once in just about 20% of the time, it's like there's a lot more research that need to happen in order for this to become a thing that robots can do reliably. But I do think there's a lot of value in the proof of concept just to show that this is a set of techniques that this team was able to push, far enough to do this task is objectively really difficult for robots to do. And then over time we backfill and go like, ""I could actually do that in a more efficient way."" I didn't realize it only worked 20% of the time. This is a 20% success rate, meaning completely manipulated the cube to be back in the correct state, is that right? Yeah, I think the fine print is for the hardest variant of the problem, which is like the cube randomized as much as it can be. The robot was only able to get it back to fully solved 20 percent of the time. I think on average, it did it more than that. And I think also.. Yeah, like maybe one of the other details people took issue with was the fact that the machine learning algorithm itself didn't say the sequence of actions that you need to solve the cube. It wasn't like a neural net that would say, ""turn this face and then turn this face and then turn this face."" There was a hardcoded solver that was saying the sequence of actions that you take and then the neural net was just saying, ""OK, here's how you move your fingers in order to achieve this action."" So the point was the manipulation. Exactly. Yeah. So people are mad because it was a fun demonstration [laughs] I think people often take issue with the way that Open AI communicates results like this more so than the results themselves. Because it seems like they're generating attention, is that right? I think so, yeah. There's a bit of attention in the field right now between people who maybe have more traditional academic roots and who think that it's the quality of the scholarship that's important and whether it's truly novel, you know, whether the results are really understandable and reproducible, you know, on one hand. And then on the other hand, folks who typically are more of the more industrial research lab type places where I think that the viewpoint is more about, ""our goal is to push the state-of-the-art of the field forward. And if we have to do that in a way that's not totally a hundred percent reproducible just because the experiment is too expensive, that's OK, because we're moving the goalposts forward to the types of things that A.I. is able to do."" and I think there's a fundamental tension there. That makes sense. So I guess you've left Open AI, what are you working on now, Josh? Yeah, it's a good question. One of the things that I learned through Full Stack Deep Learning is.. maybe one of the beliefs that I have about this field - it's that there's this narrative in the machine learning world that A.I. is going to be part of everything and it's just going to be like software where it's just sort of happening in the background as part of every little thing that we do and it's going to enable all these amazing new applications like self-driving cars. But in general, it is just going to be there in the background, making the world about 10 or 15 percent more efficient or more. I don't know. But we're not there yet. And so, one of the core questions for me over the last six months or so since I left Open AI has been ""Why is that?"" ""What's blocking us from having just a little bit of machine learning that's just making every piece of software that we interact with smarter?"" And that's the fundamental question that I am trying to answer with this company. It's so interesting. We actually always have been ending this podcast with two questions, and that question has been one of them. I mean, you've clearly spent a lot of time thinking about it. What's some of your conclusions? If you had to pick one thing, well, what would that be? I mean, this comes back to our conversation about the robot hand, right? I think the field has gotten really good at doing really impressive things once but then one of the dirty secrets of machine learning is that turning something that is 90% accurate on one dataset and turning that into a reliable production system that is auditable and is maintainable and it's understandable and you can actually start to run your business on, that's really hard, right? I think figuring out how to answer that question is the big question that the field needs to answer right now. Yeah, I guess it's kind of counterintuitive to see a computer do something 20% of the time, and really I feel like most times that I see a computer do something that I know like, OK, it's going to do that 100% of the time. Yeah. No, definitely. Yeah. For sure. I mean I think it's maybe one of the other things I have seen through Full Stack Deep Learning and through some of the other folks that I've talked to who are trying to implement machine learning in companies, is that oftentimes, one of the hardest things to do is figure out how to get the executives in your company, let's say, the folks that are making the decisions but are not keeping the technology to actually understand what can we really do with this stuff? I think that's one of the things that's really hard about machine learning; is that it's not always clear. There's not always a clear connection between what you read about what can do and what can actually do and communicating that, I think, is another big challenge for the field. Do you have any suggestions there? I mean, almost everybody we've talked to has brought that up. So when you say make a really good class, that's like AI for everyone. Andrew NG has a class, I haven't gone through it, maybe that's the answer. But I think what you can build intuition for this stuff, but I think that doesn't come from reading the New York Times headlines. It comes from actually sitting down and looking at examples of things that work and things that don't work. So I don't have a good suggestion, but I do think there's a big opportunity to make that. It's funny. I've heard that IBM Watson in its heyday would fly executives to a lab and blow their minds and get them hyped out of their minds that the potential they have with really awesome demos and I've always had this fantasy of doing the opposite. I do like an hour with executives and make it really hard and boring and let them fight with the AI for a while, even just try to tune some hyper parameters, actually get the thing working I think would be a fun.. I think it maybe an informative experience for a lot of people. Maybe it'll help the executives understand why their ML teams are not producing results as fast as they're hoping. Oh, yeah, totally. I think that also, one thing that would help a lot is the methodology. This is what we tried to do at Full Stack Deep Learning but maybe didn't really get all the way there but I think that the methodology of successfully building machine learning systems is still pretty immature. It shares a lot with software engineering, but it's really a different field. I think that if there were an agile equivalent for building machine learning systems, that would also go a long way because it's really just like the block and tackle. Like, how do you actually make this happen? So it doesn't feel as much like magic, like, what are those crazy data scientists doing in their corner over there? And it feels a little bit more like, ""OK, I understand that this is the set of meetings that the team is having every week and this is how they're measuring their progress. I think something more operational like that could also go a long way. Seems like you could be the right guy to figure that out. I don't know. Maybe. Here's the other question we always end with and I'm really curious to know what you'll say. Just at the top of your head. What's an under-appreciated topic in machine learning that you think people should talk about more? I mean, given all the hype about so many of the topics, what's a piece that people don't pay enough attention to? I think that people don't pay enough attention to the quality of their training data. [laughs] I agree, I agree, Josh. But it's so important. So important. It is. Nice. All right. Well, that was really fun. Thanks for that chat. That was fun. Thank you for having me on.",7501 +Miles Brundage — Societal Impacts of Artificial Intelligence,https://www.youtube.com/watch?v=O2ya8M72y0U,3745,2020-07-16,"It's not that A.I. is uniquely dangerous in this respect or uniquely likely to lead to harmful, unbridled competition. But it's more that we have already gone through the lessons of other technologies. Like with the case of cars that took decades of gradually ratcheting up standards around fuel efficiency, seatbelts, airbags and so forth. And we're in the early stages of that process for AI. You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Miles researches the societal impacts of artificial intelligence and how to make sure they go well. He joined OpenAI as a research scientist on the policy team. Previously he was a research fellow at the University of Oxford's future for Humanity Institute and served as a member of AxonAI's policy, technology, ethics board. I'm super excited to talk to him. So let's look at your background. So you had a sort of an interesting path into AI policy, right? Eventually, working on energy and climate issues. Is that right? Yeah, exactly. So in undergrad, in D.C. and then for a little while after I graduated, I was working at the Department of Energy and other think tanks in the broad area of energy and climate policy. I ended up wanting to go to grad school because I wanted to do more research and less administrative work that was like boss support type stuff, special assistant was my role. I learned a ton from being in government and working on something that was super hot at the time; energy and climate policy was a big, big deal. People were talking about it way more than AI or anything like that back in the day and it was super enlightening. But I also eventually concluded that while energy was important, there wasn't as much work to be done in terms of research to move things forward and AI was this more fresh green field in tems of research and that it wasn't just a political problem of getting the right thing to happen, which I think is arguably the case with energy and climate change now where we have some good understanding of the policy issues - not everything - but we have a decent understanding of what is to be done and it's largely a matter of political will. Whereas in AI it's not even clear what we want and what should be done, let alone how to get the political will. So that felt like a more exciting opportunity for impact on the research. OK. So having thought deeply about both, which do you think is more of a threat to the future of humanity? Climate change? Which worries you more? Definitely in the near term, I don't think AI is going to end the human race or anything like that. It's still relatively early stage, but we do need to think about the long term risks of technologies as they're developed, whereas energy is something that's a clear and present danger right now; and I'm hopeful that AI will actually help with addressing it. But, yeah, I think energy is a known long-term time scale issue where we know that there's a bunch of things that need to change or to get a good factory workers, in the case of AI, we know that we're not in imminent danger of all jobs being automated or killer AI taking over the world. But we see these trends and they're more uncertain especially in how quickly they develop. So I think despite all the uncertainty as we have climate change and energy technology innovation where we don't really know what's going to happen in a few decades, we actually have more reasonable error bars for that than we do for AI where experts are all over the place. I think things will be solved in the near future or taking centuries and I think it's actually harder to be very confident about AI than energy and climate. When you say long-term, near-term, medium-term, do you feel like there is a serious possible issue with AI in the next, say, 50 years? It's really hard to say. I think we should try not to be too confident in where the trends are going to go. For example, just in the past few years, there's been a lot of technological progress that people have found surprising, you know. Very few people were expecting as much progress in things like machine translation and natural language generation has ended up happening, and similarly, things like image recognition have just gone from, ""OK, it's in the human performance ballpark"" to, ""OK, now it's superhuman"" in some metrics. It's not superhuman across the board but on well-defined tasks, we've made a ton of progress. Will that lead to at some point systems that are harder to control or that might cause unintended side effects and increasingly large contexts where they're deployed? I think it's hard to say. I think we should think of this as something that where the stakes are going to rise, both of as the capabilities increase and also as the use cases. So even if A.I. doesn't progress much in terms of capabilities, it will. . . It seems like there's economic pressure and other sorts of reasons why it's going to be deployed in more contexts and as those contracts get more dangerous, the stakes are going to rise. We're already seeing AI applied for things like predictive policing and face recognition; so things that are highly sensitive. Also things like search engines and recommendation systems that are actually materially affecting what information people get and what products and services they get is definitely the sort of thing that sees that the stakes are high and getting higher. But I know it's hard to say what the endpoint of that evolution will be. And I think the key is to think ahead and be at least a few years ahead in terms of what sorts of problems we are anticipating. So that we're not caught off guard. So things like misuse of AI systems; I think we should not wait until the risks have materialized but instead think about what can we do with the design phase to make these problems less severe? Can we develop detection systems? Those sort of things. I don't think there's any inevitable risk. I think it's more like how prepared are we relative to the capabilities at the point they arise? I feel like a lot of these things that I hear about AI, it seems like you almost substitute the word technology for A.I. and you'd kind of come to the same conclusions. Do you feel like there's something special about A.I. that requires more vigilance or different kinds of ethical concerns? Yeah, I think it starts to get a bit risky when there's more uncertainty and more variation in the kinds of actions that systems make as opposed to just being deterministic code that does the same thing over and over, as opposed to dealing with uncertain inputs and you know essentially a shift in the distribution of inputs. I think the risks grow when you're moving from software to A.I, but definitely there are all sorts of systems in society that are very complex already, even before AI. So I think the number of lines of code [and the aircraft] are the same and it's like super complex systems so it's not like AI is suddenly going to lead to this new complexity, but rather there are elements of AI such as the processing of a wide range of inputs and potentially making decisions that previous software systems were not entrusted to do because we lacked the manipulating outputs. So things like robots, for example, there are applications of AI that were just not possible if you don't have that technology. So I wouldn't draw a clean line between them. I see AI as an extension of information technology and software and going back even further, things like electricity. And it's unclear whether that cognitive element like this [switch] AI is about thinking, as opposed to just following rules blindly. And is that actually a significant element that introduces a lot of risk? I don't know whether we should be AI exceptionalists where we're treating AI like an exception or just treating it as part of the spectrum. I think there are pros and cons of both because you don't want to ignore the growing capabilities in cases where AI can do things previous technologies didn't do. I think we're seeing lots of cases of just products that would never have existed prior to the current way of AI. But we also don't want to be hysterical and overemphasize the novelty when there are lots of technologies that have big impacts on people. So do you have an agenda of some types of change that you want to enact in the world at this point, or are you still trying to figure out what you're trying to get folks to do? I'm still trying to figure out the overall agenda, but I think some of the key components are cooperation between AI developers I think is super essential to figure out what the right practices should be and to hold one another accountable. So, for example, I did some work with Amanda Haskell and Julian Hatfield in a paper last year that talks about the need for cooperation in order to avoid a race to the bottom between AI developers. So insofar as I have an agenda on that issue, it's avoiding a race to the bottom and avoiding that scenario. What would a race to the bottom mean in this context? So driverless cars is perhaps the most clear-cut case where there's an incentive to be safe, but there's also an incentive to get products to market. And in cases where systems are void in a way that's premature in some respect, like not taking into account that there can be a jaywalker for example - I think that's a case where there is a design element of the system that was not super thought through and if you have people who are overlooking these things, cutting corners in a rush to get products to market, I think that could cause harm for the whole ecosystem, not just causing individual crashes. But there are cases where it's people following their own individual self-interest to get things out there faster and this can lead them to cause harm and ultimately make the whole sector worse off. So that's why we need regulations as well as informal norms in the AI community that put things like safety and security and privacy and so forth first so that there's some guardrails. So there's competition but within certain guardrail. Unbridled competition, I think, could lead to lowering of the standards in an effort to one up one another. You mean unbridled competition in terms of researchers? Is that what you mean in this context or ...? So I specifically mean in terms of products going to market. So if there's no standards for, say driverless car safety; fortunately, I don't think there are any countries that have no standards, their policymakers are aware of this need to impose some guardrails.. But, you know, if there's a world in which the standards are insufficient and there's insufficient vigilance about how safe the systems are, then there's not going to be sufficient pressure to make sure that the sector as a whole moves in the right direction long term. So it is a big ask in individual interest and collective interest as well as short and long term that I think it could go awry. And that's why things like ethical principles that keep you thinking about the long term and clarify what's expected of different offers and also regulations that impose some floor on what the level of safety or security should be forgiven. Those are necessary in order to prevent it from just being the wild west. So when I say unbridled competition, I mean like in the actual deployment side of 'there needs to be some process for making sure of how safe systems are.' So compare autonomous vehicles to the car industry. The car industry has all kinds of safety regulations that make sense. Do you feel that AI needs some sort of different kind of cooperation than other industries maybe need? I'm not sure it means anything different. I think it just needs to catch up to where other sectors are. So you know there's a whole history of nuclear power plants responsibly disclosing an incident in order for others to learn from it. And similarly, there's a whole history of airplane crash investigations and regulations and so forth. So it's not that driverless cars or A.I. generally is this new scary thing. It's more that we need to apply the same approach as we have done in other technology. So going back to your question earlier about A.I. versus other technologies; I think this is a case where I see them as very similar. It's not that A.I. is uniquely dangerous in this respect or uniquely likely to lead to harmful, unbridled competition. But it's more that we have already gone through the lessons of other technologies. Like with the case of cars that took decades of gradually ratcheting up standards around fuel efficiency, seatbelts, airbags and so forth. And we're in the early stages of that process for AI. Interesting. Is there an industry that you think is doing particularly well on this or particularly badly? Good question. I think driverless cars are not, to be clear, an unambiguously bad thing. I think there's a ton of great work on figuring out how to make driverless cars safe and there is some sort of informal cooperation going on among those developers. Though, it's kind of tricky due to things like antitrust laws that prevent too much coordination. It's not clear exactly how much is happening and how much can happen. for other technologies, I think there's actually been a ton of really good work on things like image recognition and trying to characterize robustness in that context. So, you know, with regard to things like interpretability, we're much further along in terms of interpreting image classifiers than we are with language systems, for example. Classifiers generally, I'd say, are the more mature end of the spectrum and that we have clear sort of metrics for performance and there's a lot of open source models to compare things against. And there's a lot of documentation of things like biases and central issues with robustness. That's not to say that all classifiers are good or something like that. But I'd say that if you look across AI, there are areas that are more and less return on the rigor that people are applying. I guess it seems to me like more of the AI's industry does come out of more of a research background or maybe there is like a little bit more of a culture of cooperation. It does seem notable that so many models that people use in companies come out of open source projects or research papers. So it does seem like there's a fair amount of cooperation at least on the technology side, like maybe more than you'd actually expect. There is. Yeah. And there's an interesting question of how long that will last and what are the underlying drivers of that openness? One argument for why it might be more open than you might otherwise expect is just that individual AI researchers want to publish stuff and therefore that puts pressure on the companies that hire those people to allow them to publish stuff. And I think that's a very strong incentive. And the fact that AI researchers are sort of a scarce resource in the talent market, that gives them some influence over what the policies are at these companies. And there's also something, another argument for why it might be so open, is that it can benefit the companies to be open by getting people to use their frameworks and they can make it easier to hire people in the future. But I wouldn't want us to rest on our laurels here. And that's why, for example, in a report that my colleagues and I put out recently, we talked about the need for more generously supporting academics with computing resource, because we don't necessarily want to just assume that those in industry will continue to release products and that forever there will be that pressure from below to get these releases because there could also be cases where suddenly there's a huge commercial potential that's identified and a company is sort of closed up. Or at an international level, there's pressure due to competition with China to damp down on publishing. There are all sorts of things that could happen longer term. So I would say I'm glad that there is so much openness and collaboration today, but given how much has changed in the last five years, I think we need to start thinking about what are the policy that can you get away? You were actually talking about, I think, a paper you just recently published that talks about AI development. I had a couple of questions about that. You were actually the first author of that paper, so I can imagine there was a fair amount of heavy lifting? Yeah, I was one of first five authors. So it was super collaborative. Got it. Cool. One thing I thought was that it was a provocative title - Mechanisms for Supporting Verifiable Claims. I thought it was interesting and I think I agree with the importance of it but maybe you could share the thinking behind that. Yeah. So the basic idea in that report is that AI developers are making all sorts of systems and they want to sell them or they want people to use them. And they make various claims about these systems like we did bias analysis or they're robust in X, Y and Z ways. And there are some claims like that that are relatively easy to verify. Like, if you're in the case where it's an open source system, you might be able, if you have some technical expertise, to replicate the results, reproduce a certain error rate and verify that these AI developers are telling the truth. But there are other cases where it's not as obvious how to do that. For example, if it's a very computing-intensive system, it might be harder for academics to scrutinize it. Or if there are claims being made about privacy preservation, but there's this new fancy approach to privacy-preserving machine learning that there isn't just one standard way of evaluating it or its a new library that hasn't been subjected to a lot of scrutiny for bugs. Those are cases where it's harder to verify the claims made by AI developers. So what we did in this report is try and take a very broad 30,000 foot view of like, ""OK, what is the problem here?"" And we broke it down into institution, software and hardware as the three elements of an AI project that can contribute to allowing one's claims to be verified and in each of those cases we zoomed into what are the specific interventions that can be made. Like in software, for example, we talk about privacy-preserving machine learning and the need for more standardization and open space libraries around there so that it's going to reproduce the skill requirements and increase the clarity of how different systems should be compared, and interpretability, audit trails for safety, critical systems, etcetera, are some of the things on the software side as well as things at the level of the hardware that can incrementally be solved since it's not going to be solved overnight. We don't want to 'overclaim' what we're accomplishing here. But what we tried to do is survey what are the ways that people can show that they're actually being responsible as opposed to just saying, ""Hey, we're being responsible, we care about safety."" How do you actually provide evidence of that? And are there ways in which there are barriers to providing that evidence that we think are important? I thought it was interesting you recommended that the government gives compute credits to academic institutions. Right? It struck me. I actually remember leaving academia to go into industry. And one of the real impetuses for me i guess was the fact that tech companies just had so much more data that led to more interesting problems. And I do feel like when I talk to my friends at Facebook or Google, they feel more sophisticated in some specific ways having dealt with such enormous datasets and in a way that I don't think typically gets published. And I feel like OpenAI is one of the few places that you clearly do incredibly compute-intensive stuff. But I wonder if you actually deal with the same scale of datasets and if that might... I feel like there might be the case that a couple of big companies are getting a skillset that doesn't exist anywhere and doesn't really get published. I'm not sure. Yeah, I mean, there's definitely a sense in which some companies have this infrastructure in place for generating huge amounts of labelled and unlabelled data, and that puts them in a strong position to do work in that area. I think it's also possible to do cutting edge work with open source data through existing datasets or tripping and building your own datasets. So I wouldn't draw this hard distinction but, yeah, I think there are lots of ways in which industry provides these opportunities that aren't necessarily available elsewhere. And that's part of what's driven the academic into industry. And part of where we're coming from on this report is like, ""Is that a good thing?"" ""Are there ways that you can balance that?"" It's slightly easier to say, ""OK, let's balance out the compute side of the equation then the data side of the equation."" In part because a lot of the data is private and by nature, it's really hard to get that out of these companies in an ethical way. But I think we should also be thinking about data in terms of a differentiator, you know different sector wisdom. I would also like to see governments, in addition to providing compute credits or other means of supporting academics, also building a data center or something like that. Also, generating labelled datasets is another thing that government can do because it's not clear that whatever is easiest to collect at Google or Facebook or Twitter or whatever, is inherently the data that we need to solve all the problems in the world. In fact, those datasets by default have lots of biases. So I think one potentially exciting area is government support for datasets that could be used by large numbers of people and that are specifically designed to be less biased and have the bias that are able to help a wide number of actors. And I think the fact that you can cheaply copy data is a strong argument for this being something that governments should do. It's like building a highway or something like that that benefits large numbers of people. And, yeah, you can exclude people from the highway and do tolls and stuff but generally, it's public infrastructure. And weirdly, producing datasets that can be widely used is another thing. That's kind of a cool idea. I mean, it does seem to me that datasets have pushed a lot of innovation, and ML. I also remember when I was a grad student, there was this frustration. It seemed like the tasks that we worked on were based on the datasets that happened to be available. Although I feel like one of the issues that we had maybe was the collectives of people coming together to create datasets but it would create a huge amount of bureaucracy in that data collection process because no one person really owned it. And then I think it would end up being a much bigger undertaking than it necessarily needed to be. So I could definitely see governments having trouble making decisions like, ""OK, what are we actually going to collect here?"" Yeah and it's not clear what's to be done. You can also imagine, in addition to Compute Credits, giving academics data labelling credits that just allow them to, ""I go to some third party service and you pay for some amount of data."" I think there's probably a role for that in the same way that there's also a role for big public datasets that a bunch of people use for some general class of problems. So I think you ideally want to reduce the barriers to entry on, both regenerating these big useful datasets, but also making sure that people have access to more tailored data for their own needs. I'm curious. It's a little bit jumping around but it seems OpenAI has actually been a real driver towards using more and more compute on these different charges and it kind of makes it hard to reproduce and makes it potentially with some environmental impact. Do you have any thoughts on that? It's a really good question and I've liked a lot of the publications on this topic, like the paper, the Green AI and some folks at the Allen Institute [was a good...] of this topic and lots of other people have been calling attention to it. I think in general, my view is that all things being equal, we'd like to not use more compute than is needed for solving a given problem. But there's some ways in which it's not as urgent or as bad of a problem as it might at first appear. Such as the fact that there's retraining step of a large model, for example, only needs to be done once and it can be fine tuned relatively cheaply or even used in a zero-shot fashion by millions or billions of people. So I think it would be strange to look only at the spending side and not also the inference side. I mean your question wasn't about one or the other specifically, but I would just flag the difference between the two in that, say, Google using a bot for a search result ranking, for example. Presumably almost all of the compute cost there is on the inference side rather than the training side and as originally as possible, it could be a few hundred thousand dollars or something for the training. And now it's been years at for a billion years. That's not to say that it's not an issue in terms of the environmental impacts but you also have to look at the whole product and think about inference, whereas I think a lot of the attention so far has been on things... Yeah, I guess that's a good point. I wonder why that is. I actually felt like, when I looked into my back of the envelope calculations on the whole thing, it seemed to me that even if you took all the graphics cards that are made and at random times evaluated, it wouldn't be near the environmental impact of regular data centers. But I guess the trend line is certainly scary, right? Because it's like this exponential growth in volume of usage and I guess maybe it feels like there's a more natural barrier on the inference because it's... I don't know, why does it feel like that? Maybe because companies are doing it. I guess it seems that there's some limits to the inference, whereas the training seems to be skyrocketing. At least that's my impression. Yeah. I think that's right in that we haven't seen many models being served in production. And generally there's a lot of optimization pressure to keep things there. Whereas in my training and research, it's a bit more like everything goes and trying out things to see what works. It'll be interesting to see how that evolves over time. Another thing I'd add is that I would also try not to pass all of AI and MS into broad brush in that depending on the use case, you could actually be saving energy. Like the deep mining Google uses and deep RL for controlling data center energy consumption, for example, is a case where you're actually able to reduce the net amount of energy used by applying some AI upfront. One question I asked Jack, and I thought his answers were interesting and I'm pretty curious about yours, is the whole thing about not releasing the biggest GPT-2 model. But honestly, here's what I thought about it and I didn't even tell Jack this. This is my impression. I don't view myself as the expert on this stuff but first openness seems really, really important to me it's a core value and if people are going to do stuff and call themselves OpenAI, they really should be erring on the side of making their work public. But then I thought, well, it's kind of interesting that they've chosen to think about the impact of releasing this and making a controversial stance here that also I thought, I wonder maybe they're right. It certainly seemed to me at the time that a really powerful language model could be used in bad ways. And so I think I didn't feel so sure of myself, about what I thought. And then it seemed like it didn't really matter because other models came out about a month or two later and it almost seemed like maybe the most surprising thing is that there weren't more applications of such impressive-looking technology. I don't know why I started. But I'm kind of curious from the inside about how it felt for you and if there were any lessons you learned on it. Yeah. So from the inside, we also felt very unsure what to think and we tried to, at each stage, say clearly, ""Well, we don't know how dangerous this is and we got this information and we're trying to reduce the strength to error bars over time in terms of both beneficial use"" That's not to say that we eventually converged on an overall conclusion of, ""OK, this is definitely good for society"" but we started with a default of openness and these concerns arose in terms of people building proofs of concept of rating reviews for Amazon, for example, and that seemed pretty scary. Writing to fake news stories seem pretty scary... And ultimately what we ended up doing was taking an incremental approach to releasing progressively larger versions of the model and obviously if we could go back in time and we would take all the insights that we have now and feed that into an earlier stage in the process. What we ended up doing, I think, was a reasonable approach in the sense of, if there's a potential irreversible decision, like releasing a model, it makes sense to be a little bit cautious before you do it if there are ways you can gather more information. I think there are ways you can get some information by doing things like human studies. And we worked with Extra Mile Run who outputs by people and statistically study the differences across the model sizes and that informs some of our decision making. But ultimately, it's really hard to answer the question. For me, it's about the economics of it and the motivations of bad actors. So I think it's an ongoing issue that you can't really fully resolve. Do you feel like you really got new information that informed decisions along the way? What kind of information did you collect and what different information could you have gotten that would have made you make a different choice? As a concrete example, we were very unsure what those scaling relationship is between model size in this GPT-2 regime of 125 million to 1.5 billion parameters. We weren't sure what the relationship was between model size and convincingness or ability to be coherent and clarity and so forth. We had a rough sense that there was this smoothish relationship, you know, as you grow in model size, it takes fewer and fewer tries to get given results. So less cherry picking is required for a given fixed level of performance. And that seemed to be true, but we aren't sure of, OK, for a given level of cherry picking, what's the level deeper. And what we ultimately found was that there actually wasn't a huge difference between the two largest model sizes and that was one of the factors that pushed us towards releasing the 1.5 billion model. When otherwise, if there had been more of a gap between the two that one felt like there were more risk in doing that release. And there were also other things happening, like other people releasing models and now we're able to do some comparison between them. So we were trying to absorb as much information and generate as much information as possible as well. But overall, that's probably one of the most clear cut cases where the real diminishing marginal risk as you increase in model size was a reason why we felt that for scientific reproducibility reasons and other reasons, the benefits were outweighing the cost because there wasn't much increase in risk but also we're seeing significantly improved performance prospects in the standard NLP benchmarks. So it seemed like it was a non-trivial increase in utility from a research perspective and also would allow people to start grappling with some of the issues involved in training these large models. It also didn't seem to be a huge part. There's someone here who worked on natural language processing a while ago. In my view, the GPT-2 results were incredibly impressive. And I thought at the time, this must've been about a year ago, that the applications would be enormous. And I think actually the applications have been subtle. Like I've noticed translation systems working a little better than they used to. And there is that crazy adventure game that I've played and you know, it's kind of fun, and I've seen people suggesting plausible domain names for your business. For example, on our website, we see a lot of models come through so we do see people using the technology. But I don't think that my mother has noticed a difference in her world. Maybe that's not surprising in retrospect but it's funny how it seems like there's this huge leap in my head. And I feel like the vision stuff. we may be feeling a little more. I feel like face recognition feels a little more ubiquitous to me. It's scary. At least I notice my camera somehow finds people's faces and things like that that it definitely couldn't do a few years ago. What do you think about this? I tend to think in terms of general purpose technologies that could be misused or could be used for beneficial things. So I'm basically saying the same thing that I was saying about the risks. So we have some information about the fact that you can produce coherent text in some contexts, and that seems like it could be used for lots of commercial and creative applications, and also some malicious applications. But we might just not be at the level of performance for either of those domains where it is a straightforward replacement of what people were already doing. I think we'll get there eventually and language models will continue to proliferate to a bunch of different applications including on the edge, in the cloud and all sorts of contacts. But I think a few things need to happen - there needs to be more reliability and higher performance compared to humans on some tasks where it's just not going to make sense to replace human or augmented human if it hasn't yet reached that level of performance. And generally, I think we also need to figure out what are the right workflows and ways of leveraging the system look like. Because I don't think it's replace a human with a language model; I think that's one of the more naive uses that you can do. But that depends on a very high level of performance and the right kind of space where you don't need online monitoring. But I think there's also other cases like writing a system where the fact that it's not a 100% reliable is not a game, it's not a deal breaker. But if you're able to get humans in the loop to provide feedback on these systems and choose among a few initial outputs, both for beneficial and malicious purposes, I think that could be a game changer. So I think we'll see further progress both on the technology and people finding better ways to use it. Interesting. Does OpenAI continue to push things forward or are you like, ""OK, we made this model. we're good."" How do you guys think about that? We're so continuing to push things forward both in terms of trying to understand these models better, like the ones we've already built, and also trying to push further in terms of improving performance. So this is the question that I've always had about OpenAI. If you don't mind, I'm kind of curious. OpenAI's mission is something along the lines of ethical AI, am I right? Remind me. Yeah. The shorthand version is build artificial general intelligence and use it to help everyone. It seems like a funny mix of policy and building. Do you think it's important that those two things are together? How would you argue against this? Because I think that they should go together but I'm trying to think from the other point of view. Does policy and building really necessarily need to be combined? Is it even combined in most cases? Because it feels like the policymakers in general, aren't always engineers. Right? Why have a thing that combines both? I think by default, a lot of people who are building powerful technologies are 'de facto' policymakers and that they're setting the default and how people think of things and they're influencing what the early applications are and so forth. So I think you can't totally separate them. I think anyone who's involved in technology should be thinking to some extent about other social impact. And, you know, there's also value to division of labor and that's why not every single person in the organization is on the policy team. We have various different teams. So it makes sense to have some specialization. But I think the reason we think it's such a high priority is that we don't see the impact of AI as reducible exclusively to the design of the technology. It's also about what sorts of policies are there to constrain the use of it? And are there ways of distributing the economic upsides of it so that it's only benefiting a small number of people. So we think that it's not just a matter of building the technology but also making sure that it happens at the right time in the right environment with the right infrastructure. So that's why we invest a lot in advocating for policy changes that we think will enable a more co-operative, more responsible AI ecospace. I'm sure you have a group of AI ethicists or policymakers that you hang out with. I'm curious about those circles, the things I hear from people that are interested in AI don't seem very controversial to me. It seems like people want fairness and they want openness and transparency. And these things all seem like sensible, wonderful things to me. But I guess where there are disagreements, are there different factions, you think, of people thinking deeply about this topic? There definitely are. So I'm not sure I'd say factions, because I try to build bridges and stuff like that, but they're definitely differences in emphasis between people who are focused on immediate ethical concerns around things like bias and surveillance which are on one end of the spectrum, or multiple spectra, but that's one end of one spectrum. And on the other end is people who think existential risk from AI systems that are too powerful to control unless you've thought really hard about it in advance, then you by default expect things to go bad; that's the thing you should focus on and devote 100% of your attention to. I think personally, I find myself somewhere in the middle, and OpenAI as an organization finds itself somewhere in the middle, and that we are thinking about both bias and long term safety and thinking about [prison] systems and AGI. And I think in terms of fact, it's ultimately like both sides are just trying to figure out how to make sure this technology helps people and doesn't hurt people. And in fact, often, the conclusions in terms of the actual policy you recommend aren't that different, depending on whether you're focused on the immediate term or the long term. I see it more as a spectrum, but there are definitely people with different emphasis. I feel like if the ethics is interesting, it must have, there must be controversy. There've been decisions internal to OpenAI even to philosophical discussions where people really take strong and different stances. Because I think when you put it in this general sense, it's like who would argue that we shouldn't make AI safe and who would argue that we shouldn't make AI biased? But I have a feeling when you zoom in on what that really means, there must be points where they come into conflict. Yeah. Oh yeah. And to be clear, you had asked me about factions, so I tried to give you a map of factions but I think there's also, within the factions, all sorts of debates and I know that they're not consensus actions. So, yeah, like I said, let us take the short term issues. So I think among the people who are focused on short-term issues, there is another spectrum of people who on the one end are thinking, ""We have to figure out how to do this right and in each case figure out what the right norms are,"" and they see it being both intrinsically important and symbolically important to get issues like bias and so forth, sorted out as soon as possible. They're like hardcore, like, ""let's make sure that we're not causing harm"". Like maybe the Hippocratic Oath end of the spectrum; first do no harm and take stuff like bias super seriously. And there are also lots of people who are building systems that are focused on getting products to market and they're focused on releasing systems that can inform research. And in the case of GPT-2, there was potentially a tension between avoiding causing harm on the one hand and enabling people to understand our research and verify our claims and build a new system. And so you could maybe say, like Hippocratic Oath end of the spectrum, and then Move Fast and Break Things end of the spectrum. I think there's an element of truth in both of them; the harming people on the one hand and also that this technology needs to be iterated and there needs to be publication and models getting out there in the world in order for there to be learning by doing and ultimately figuring out how to solve those problems. So, yeah, definitely. You could see conflicts there. And in the case of GPT-2 there were definitely a bunch of different perspectives internally at OpenAI and ultimately, we tried to wrangle that into a consensus view. But you can see the ambiguity and the fact that we were hedging a lot of our claims like, ""We're not sure how dangerous this is,"" ""We're going one step at at time"" because there were actual different competing values at stake. I wouldn't say totally Zero-Sum, but there was some Zero-Sumness between avoiding harm if it turned out that GPT-2 was very dangerous and allowing people to verify that type of claim. Yeah, that totally makes sense. I think you mentioned that you don't really have a good sense for what all the policies would need to be to.. Unlike climate change where it's pretty maybe more clear what the sensible policies are. Are there some policies that you would enact if you had control or if you could recommend a few things to say, the U.S. government? What would those things be? Yeah, so some of the stuff that we flagged in the Toward Trustworthy AI Development Report, I would definitely consider here. So stuff like compute for academia. I think generally a super, high, level one policy change that I'd like to see is more robust support of academic research including not just compute, but also things like data and just funding for academics so that they're not constantly writing grant applications. I think there are lots of inefficiencies in the way that the American academic system works currently. So, like more long term funding of work in areas like security, surveillance privacy would be good, more support for working on things like bias, because I think we're currently in a regime where there's a little bit of funding here from the Department of Defense, there's a little bit of funding here from the National Science Foundation, and then there's everything that industry is doing. What I'd like to see is a world in which an academic or postdoc or something like that doesn't see a big tradeoff between work in the industry versus academia in terms of the resources that they'll have and the amount of freedom that they'll have, because I think we'll see both faster progress in AI and more creativity, if people are able to think long term in different sectors and not be constantly fighting over money. But also there are areas that by necessity need to be worked on over the long term. So things like, for example, you would want that to be always an option for ambitious grad students to work on that but it's actually not really the case today. A lot of the time today, one could go to grad school and not find an easy way to work on AI safety because a lot of the grants available are permanent defense on X, Y and Z or something like that. They do fund some safety stuff but I would like to see more balancing between the civilian defense side and also more robust long term funding for academia. That makes sense. What about regulation? Besides more funding, would you want our government to enact some laws now, putting guardrails around AI research or deployments? Or would you want them to wait and see until there's more information? Yeah. It's a good question, I think my answer is yes, they should do some stuff now and in other areas, they should wait and see. So I think in areas like driverless cars, for example, there's already a clear reason to act quickly and develop sectors like regulations. I think another area where I'd like to see more progress in developing clear guidance for entrepreneurs and others is health applications in AI. So where there is some effort to figure out how AI systems should flow through the FDA, for example, and what should that review process look like? I think that's an area where it would be beneficial to see more investment in capacity and expertise in the government so that they have a better ability to process these applications. And also, clearer structures for getting AI systems through strict regulatory process - that I think could be very valuable. I don't know exactly what the details of that should look like but, you know, generally that's something where we do not want people putting out health applications that are causing harm. But we also don't want zero health applications from AI. So I think that's an area where some guardrails are to give people long term clarity so it's not just a black box of, ""Oh, will, my system be able to be deployed in like a year or two?"", but then have long term signals I think would be very valuable. Yeah, that makes sense. Cool. Well, we always end with two questions. I'm curious as to what you'll say about these. The first one is kind of broad, but off the top of your head when you think about all these topics in machine learning, what's the one that you think people don't talk about enough? What's the underrated topic that you'd like to push people to go maybe learn a little more about? I think one super interesting thing is detection of language model outputs. And I mean, this is personally me being biased from working on GPT-2 a lot but there's actually a ton of super interesting research. Some things like how the sampling strategy relates to scale-ability and how model size relates to detectability and one area that's how fine tuning relates to all that. So you can imagine a world in which GPT-2 or other systems are used to generate homework. And this was... Someone was just giving me an example of this on twitter earlier today that I find funny. So it's like people are using it to cheat or to generate phishing emails or something like that. I think it's a really interesting question, like what the limiting case of lie detection of language models is. Like will it just always be...? So like who wins that arms race? [Chuckles] Yeah. Who wins an arms race. And also are there steps that you can take to make it more winnable from the defender's perspective? Like watermarking and things like that. Hopefully it's not a super urgent issue and hopefully, there aren't that many people actively trying to weaponize these or whatever but I think there's been a ton of work by Google, Allen Institute for AI, University of Washington, OpenAI and elsewhere, on trying to push the state of the art forward. But we still don't have a general theory of how all these things will be together. Interesting idea. What is this state of the art? Can you generally detect these models? Yes. So the state of the art is around 95%-ish detection of where... So we use a Roberta model actually in order to detect GPT-2. So when we released our newest and best system, we're actually using a smaller model to detect the outputs from the larger model. So that's one of the things; is that in one of the early findings from folks at renowned University of Washington and Allen Institute was that models can detect themselves and this was an argument for releasing. We later found that sidestepping and trying a different model and then using it to detect the original ones becomes more better. One of the things that we found is that it's easier to detect all our models. Maybe not surprising because they're worse in some respects and they might be making more errors that are catchable but, yeah, that's what our initial findings were then. But then other people found other really interesting things like what are humans picking up on versus what are AI systems detecting. Like AI systems can detect weird stuff like distribution of adverbs versus nouns or something like, ""Oh, therefore it's fake."" But humans are not looking for those kinds of things. They're looking for, ""Is it saying something that's incoherent or is it repeating itself?"" Things like that. So that's another interesting finding; that humans and machines are complementary in terms of how they detect things. I guess that makes sense. Maybe I'm overconfident or out of date, but I feel like I can still detect these models pretty reliably by noticing that they make no sense. Yes. So, there's a good game that you can play that attracts fake versus real trump tweets and there are a few other quizzes like that that I think are worth trying and sometimes are harder than you might think. At least like in the context of fine tuning. I guess the Twitter genre is really kind of pushing us to... Yeah. I can see Twitter being a tough medium to detect human versus machine. I feel like I can handle a few paragraphs, I think I could do it. But now I really want to try. Yeah. There are over a few paragraphs part that's critical because that's one of the other robust findings; it's actually kind of good in that respect that Twitter recently went from 140 to 280 characters. (Laughs) That's a sweet spot in terms of detectability. It makes it a bit easier. Well maybe are popular culture is nudging us more towards the language of machines than machines are learning even. (Both Laugh) Okay last question. I guess this is kinda more for practitioners but I think actually I'd be curious about your take on it. When you look at things at OpenAI or elsewhere and you look from the conception to creation and deployment of a complete model, where do you see the bottlenecks happening? What's the hardest piece of getting something done and out the door? Good question. It seems like a lot of it is finding the right hyperparameters and the kind of stuff that y'all are trying to help out with - getting the right type of parameters. Obviously, like compute as a bottleneck; you know, if you don't have enough compute but I'm thinking in the context where you have a decent amount of compute. Data is definitely always an uphill battle. Like even if you have good engineers who are good at gathering and cleaning data, there's always room for improvement. So I'd say I'm not sure that I call it a bottleneck, but something to push on is the quality of data. Yeah. Also, related to the hyperparameter thing is ML weirdness. It's hard to debug ML systems and like weird silent failures which are kinda related and also the weird silent issues and data stuff as well. All of those lead to various weird dynamics. I do want to say for the record that I hope that our product helps people more with actually the kind of collaboration that you're talking about. Like specifically tuning the hyperparameters so I think both are important. But I'm really curious, actually, because we watched, from the sidelines from far away, OpenAI try to build the robot hand that manipulated the Rubik's Cube and just from casually talking to folks on the team, it seemed like people felt like you're really close and then it took about two years for... That actually got done but over a long period of time. What happens in that year or two? From your perspective, like, what's going on? It can't just be tweaking hyperparameters, can it be? I should emphasize I'm not a technical person and it's not slowly just tweaking the hyperparameters.. I know, in a way, I'm just in need of a clearer perspective because I don't know what they're doing and you're just watching in from the outside. I'm curious what you'll say. In the case of robotics, from my perspective, it felt like a fairly smooth trajectory of every few months there would be some kind of demo that seemed a bit more impressive. It wasn't like nothing happened for years. It was like maybe we didn't solve the original problem for a while. But it seemed like there's always some area to push on. And I think you mentioned collaboration, I would say just knowing what sort of techniques to apply is another part of it. So it's not the hyperparameters per se but knowing how to get transformers to work and in some new context I think it's non-obvious in the fact that there's not always sufficient information papers to reproduce things sort of requires you to do a lot of trial and error. And that's just how ML research seems to work. It's like trying lots of hyperparameters and it just takes time. Thank you so much for taking the time to talk. We'll put some of the papers you mentioned in the show notes. That was really fun. I really appreciate it. Definitely.",9527 +Hamel Husain — Building Machine Learning Tools,https://www.youtube.com/watch?v=TMe8xz4cUKs,2165,2020-06-24,"So now you have this really rich record of everything in the PR that's associated with that PR. And it's getting closer to proper software engineering practices where you have the test that's automatically conducted from the PR itself and you have all the documentation there. You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. I love talking to Hamel Husain because he likes making tools for machine learning practitioners the same way I do. He's currently working at GitHub but before that he's built large data science teams at Airbnb, Data Robot, and a whole bunch of other really successful companies. I can't wait to talk to him. Hamel thanks so much for taking the time to talk to us today. You are the first guest that we've had that I've actually worked with before. So, we have a lot to talk about, and I thought, the cool thing that we worked on, which will I just love for you to describe is the CodeSearchNet project that you spearheaded. Can you describe it for somebody who doesn't know what it is, and what the goals are? So Github has a large corpus of code, as you might imagine. You know, there's all these open source repositories and many different languages. And in the machine learning community, natural language processing is a really exciting space, especially with deep learning. There's a few datasets that people really like for that, but one that hasn't been paid attention too much is perhaps a large corpus of GitHub data. And the thing is that the data is open, it's already open source but the barrier to entry is kind of high, especially when you're talking about code to tokenized code or pass code. It's very complicated, especially when you strip out comments or do something like that. So internally, with the GitHub project, we wanted to explore representation learning of code and explore the possibilities to see if we could represent learner representation of code. Sorry. What exactly do you mean by a representation of the code? Like some abstract representation or how do you think about that? Oh yeah. Sorry. I mean it in very canonical machine learning sense like learning and embedding of code. So that aligns with natural language, that was one of the experiments that we wanted to try to then see if we could use that to boost search results. So someone types in a query... You know, a lot of people don't like GitHub search, understandably, and, you know, if you're trying to search for code, right now its keyword search so you have to have a good idea of what the syntax is or what keywords may be in the code that you're trying to find. But what if you don't know that? What if you're trying to search for some kind of concept in a code? Is that possible? And so we started exploring that to see if it would be possible, perhaps, to use machine learning to learn something and embedding of code to then do some kind of semantic search. So, one of the interesting parts about that is you might wonder how you would go about doing that. How would you go about learning some embedding to code? And so stealing a lot of ideas from natural language processing, you know? So it is useful in natural language processing if you have parallel corpus of let's say one language to another language, like a language translation. So we thought about that and said, that's interesting. I wonder if we can do that with code since there is a lot of natural language. It happens to be inside code and specifically comments of code where people naturally are sort of labeling what code does. And so now this is a tough problem because comments can be everywhere but, not necessarily in the same place or not necessarily at the same level of granularity are in the same format. And so what we did is we scoped down the problem to methods and functions in various languages and looked at the docstrings or what is equivalent to a docstring in Python, which is some comment that documents what that function is doing. So, we constructed a large parallel corpus of all of these things, and we did some experimentation, and it was a really exciting project. We had to stop in the middle for various reasons, as sometimes happens with machine learning projects that are ambitious like this. But in doing so, we thought we should open source the data and we should present the results we have and give it to the communities so they can take it forward from there. So the CodeSearchNet challenge is this large parallel corpus of code and natural language and its benchmarks of our attempt at doing information retrieval, so, given a query of some kind, can you identify the piece of code that goes with it on this case? If you are given a docstring, can you find the code that is paired with that originally? So, the benchmark is an information retrieval task and we thought, ""okay, even though we're pausing on this for a moment, we know that everybody is interested in this"". Not everybody, but a lot of people may be interested in this. We saw people, various research labs exploring similar problems, so we thought, we should get in the mix being GitHub. And so that was the kind of impetus to release this dataset and of course, we had a wonderful partnership with Weights and Biases - I've heard of them [laughs] Who benchmarks in the leader board of the different submissions that people have and the improvements to the model. And I think what I really like about Weights and Biases is the transparency. So like in Kaggle competitions, you only really get to see what is behind the scenes if the author chooses to release the code, but with Weights and Biases you can see all kinds of detail, it's very transparent and you get very rich logs of what happened during the training process. So that's really helpful and I think the whole community can see that. And I think that helps drive that forward. So just to clarify, the challenge is to find the code that best matches the docstring. Yeah. So there's a couple of different tasks. There's one task; matching a docstring to code which is sort of a proxy for what? So the search query may not look like a docstring. Probably doesn't really. And so that task is searching for comparing the docstring with the original code which is a proxy for search. So we also have some search, actual search queries that people have done against Bing; since we are in Microsoft we can pull that. And, you know, we have some ways of finding out things like what page they landed on and if that was inferred or that was the thing they were looking for. We have another test set that is actual queries from a search engine. Even that is not perfect because the task that we would want is to simulate a more scoped search of code, not a global search like on Google or Bing of how to do something. I think that's a solved problem. That works really well, at least for me. So I would say the task isn't perfect, but we did whatever we could in the time we had. I think it's awesome that you released the data. Since we work together and I think I understand it - it's like, I search for some algorithm, maybe like insertions and then I map that to code that actually doesn't search in embedded space that's like the embedding of insertions or.. Right? Yeah, absolutely. Like, if the code doesn't contain the word insertion or sort, you can still find that code because it does the magic. Yeah, it does. And thus the idea is to enhance discoverability of code, to do various things. It's super cool. We'll put a link to materials so folks can find that if they're more interested. So we actually haven't talked about this but I was looking at your background and seeing that, you already worked at DataRobot and you wrote about AutoML. This is another question we get all the time. What do you think is the state of AutoML? Is it something you use in practice? Would you recommend it to people? When is it useful and is it not useful? I think AutoML is really misunderstood a lot of times. Maybe define it first because we may be talking about different things. I think we're talking about the same thing, but I'll define it just for clarity. So I would say I know AutoML are tools that help you automate your machine learning pipeline as much as possible. And that can mean various different things. Obviously, that definition has a very large scope. I mean, of course, you can automate some parts of your machine learning pipeline. You got to try to automate the whole machine learning pipeline. You know, what part are you automating exactly? Totally That kind of gets into the weeds. But I would say something that sufficiently automates a lot of it. You know, you can kind of bucket that into AutoML but even that is.. I mean, I don't know if there's like an official... There's an organization that has a definition. I don't remember at the top of my head, but that's the way I would define it personally. So, you're right. I did work at DataRobot, which is a company that has a piece of software as a service and they're one of the first folks to really put AutoML out there. I would say before DataRobot tools that you may have seen are, I think it was called weka or Autoweka, and then maybe RapidLiner was one that was popular; things that sort of automate the machine learning pipeline. So drilling a little bit more, what DataRobot does is you feed it a prepared set of data, and so you've already done tons of work before getting to this process and you have a target variable and all these features, and so what DataRobot does is it tries a lot of feature engineering steps; almost like problem agnostic feature engineering steps, and it tries many different algorithms, all from open source. It benchmarks them against each other and does lots of diagnostics, like an incredible amount of diagnostics on your data and then gives you a leaderboard of all these different models. Again, they're all from open source and that's one flavor of that, there's other people who do this. So like H2O has product where they have, I think they call it AutoML or self-driving ML or something like that. I don't know what it is exactly, but they do something like this. The reason why AutoML is misunderstood is people think of it in a certain way, put it in a box and they say, ""this can't replace a data scientist"" or the objection is, ""why would I? I can beat this thing. I have domain knowledge. Why would I want to use an AutoML system when I can build a model with my domain understanding to fit the needs better, then why would someone want to use AutoML?"" That's the common misunderstanding. I wrote about this when I was at AirBnB, the blog you're referencing, and I think the way that it's used most effectively is to really augment a data scientist. You may not use any of the models produced from the AutoML system, which kind of sounds ironic, but really a AutoML system gives you a lot of information from the very beginning. So I think it's really important to have a baseline and the better your baseline, the better off you are. And so you can use an AutoML system to give you a pretty competitive baseline to begin with and the reason that a lot of people use linear regression or something or some simple model or just the average, as you know, baseline or whatever it might be, is, that's easy to do, and you need a baseline to compare it with what's going on. So, that's helpful, and then also you get a lot of diagnostics, you get a lot of things, you know, something that automatically explores your dataset and you can read. You're just getting more information about the task at hand. If you do that, you can use that information really effectively to go and build your own custom model and start with some more hypotheses about what might work or what might not work or invalidate some hypotheses. I mean, it's not uncommon to hear, I hear this all the time, you know, data scientists will say, you know what, random forest, they don't really work on this model or on this dataset or whatever. Neural nets, they don't work but, you know, what if the AutoML system produced much better results than you did and used in that model, that's really interesting. Like, why did that happen? That happens like a fair amount number of times. When I was at the AirBnB, which you know has a lot of talented people, it's really interesting, like sometimes AutoML system would give you some result that is something you didn't expect. And so, I think it's just really interesting. It's a way of using AutoML and I don't really see AutoML replacing data scientists, but I see it as an incredibly useful tool. I mean, just even in doing lots of exploratory data analysis. I know that sounds trivial or easy, but just to have something that does that really nicely for you and gives you all kinds of statistics and metrics and all kinds of graphs, for free. It is just a head start. That's my spill on AutoML. It's interesting. Do you make a distinction between AutoML and hyperparameter optimization? I think hyperparameter optimization is part of AutoML. Like, if something is really AutoML, it should also be doing hyperparameter observation and that's what a lot of these AutoML frameworks do. But I wouldn't say hyperparameter tuning on its own is what I would just call AutoML. I mean, this word AutoML is, like I said, very hard to define, but it's definitely something that should be included in there. But what extra stuff does it do? Like try different algorithms, you mean or what does it do on top of that? So a lot of the magic of, for example, Data robotics, (because I'm really familiar with that product) it does a lot of feature engineering. It's built by people that are grandmasters at Kaggle. They have about three or four or actually more former grandmasters, and then a lot more people close to that. They've taken all their experience of winning these competitions and put a lot of recipes in there. So, things like, ""If you're using a tree based model, here are some feature engineering steps to try"", ""If your data contains text, let's try these feature engineering steps"", ""If your data looks like this let's try something else"". It's like a lot of these rules built in, but also still have a lot of different recipes included, things like model stacking, ensembling models, you know includes hybrid primary tuning. And so it is a incredible amount of diagnostics, so you get feature importance type of stuff in many different ways, you get a lot of model explainability stuff. And so all that information is pretty useful, regardless of what model you're going to go with, to understand something really fast. I feel like a lot of people think of AutoML as a way to get the best model.. And yeah, it's funny, I start to tell people that... I also share your view that it's a good way to do exploration or at least a hyperparameter search, I think it's a great way to understand the domain you're in but I guess maybe it's because Google has that automated product where you actually don't get a lot of data, I think you just get the best model out of it, that sometimes I think of AutoML specifically as just a way to find the best model. But, of course, if you get to see all the different runs and how they did, that would be incredibly useful to learn about your dataset. Yeah, I agree, and yes, some products, I think they actually think maybe they've designed it in such a way where they think of it as a black box and just give you the best model. I don't really think that's incredibly useful. Why isn't it useful though? I mean, because I think from a business owner's perspective, they might be like, ""Awesome, it's a pain in the ass to make models. Let me just get done with this."" Well, I would say it is still useful. I agree with you, but it's not as useful. I mean, I would say being able to see everything and learn from it is a lot of value added and you can still build it and not look at it, all that stuff if you want to, but, I find it to be useful as a data scientist, I found out that it makes me a lot better, you know, and helps to check some stuff. When you were working at DataRobot. Do you think most teams are using DataRobot more for an exploratory use case or for an optimization purpose? It was pretty mixed. So I worked with a lot of customers there and people were using it to sort of.. So a lot of people already had a model in production of some kind. And, you know, they just wanted to.. and that was really an excellent-use case i mean plug your data into a data robot and see what's going on. And that was a really popular use case. Another one was; okay, you don't know how to get started. You're taking a long time to build a model, you want to use this.. a lot of people actually use the product to help learn about data science that are so transparent. They could see all the steps like you did, and they would learn about it, like the workflow. But I think there was a mix between taking something you already have, putting it in there and exploring. I would say people that already had, like data scientists - experienced ones, they found it really useful to get new ideas they didn't consider before. Makes sense. All right. So Hamel, what are you working on now? What's your day like? Can i guess? I've just followed you on Twitter, I haven't talked to you in a while. It seems like you are really into GitHub actions that I don't totally understand, but I want to learn. Oh, yeah, that's a great question. Yes, I am. I mean, I didn't try to be, but it just worked out that way. So you're absolutely right. I kind of became the GitHub actions person for data science by accident i think. To answer your question, I'm really interested in tools that can help data scientists, and building those. That's why I like talking to you so much, because you're also interested in that. Totally. That's what I was doing, why I was interested in working in Data Robot and I did a lot of that in AirBnB and then at GitHub I'm doing that also. And so one of the things that I've tried to do is to find ways in the short term, there's some long term things I'm working on which we're exploring, but in the short term, what can I do right now with the products that GitHub already has to make data science easier? And so it's being creative and trying to figure out what I can provide. So, like, there's a couple of examples of that. One is CICD; continuous integration delivery for machine learning workflows. So GitHub launched actions, as you alluded to. When I saw that launch, I realized that there is an opportunity to construct CICD plugins that will allow people to have machine learning workflows, or that have CICD workflows for machine learning. Just 'cause I think a lot of people might not know, including myself, what is the GitHub action and why is it useful for this? Yeah, that's a good question. On the surface, GitHub actions look like another CICD system, like Travis or CIRCLE CI or something like that. You know, compute that you can run triggered by some GitHub event. And this runs on GitHub? Yeah, this runs on GitHub. But the way that it's differentiated is in a couple of ways. One is; You can fire a GitHub action to run on any event, almost any event. So an event means opening an issue, commenting on the issue, opening a PR, labeling a PR, just think of almost anything that you can learn happens on GitHub, you can trigger actions to run arbitrary code based on that event. And then the reason why Actions is special is for two primary reasons. One is, all the metadata associated with the event that triggered the action is hydrated into the action's environment. So, if you want to know who commented on that PR or whatever, it's super easy to do because it's available inside the environment. Secondly is, let's say you create a workflow that is super useful, something like, ""I'm going to run machine learning workflow, log my metrics to Weights and Biases."" And then you report the metrics back into the PR in GitHub. That's pretty useful. And so if you want to, you can package it up a little bit and you can say, ""Okay, I have this workflow."" It inspects it. It expects this input. It expects a run ID, and as an output, you know, it will comment on the issue with this formatted table or something like that. I'm simplifying it. But really, I mean we can talk about that. And then I can just use your workflow. I don't have to know anything about how you did that. I can just say, ""Hey, this action and this workflow is pretty cool, i just have to feed in a run ID and something like it and it will do the same thing on my repo. So, i can just reference your packaged workflow from your repo. I don't have to install anything or do anything and so I can compose many different workflows together that do very modular things. Sorry. Dumb question. So, how do I pass in the run ID, like where would that happen? So with every step in an action's workflow, if you are using a packaged action, there's input. That input can come from anywhere; you can hard-code a string or you can say that input can come from another step in the workflow. Like specifically for Weights and Biases because I love them so much, I made an action that does this. So what happens is I actually log the SHA, the commit SHA that is associated with a machine learning workflow and two Weights and Biases and then when the model finishes training, it takes that SHA and pulls it from Weights and Biases, so that becomes the input. That's one way of doing it. And then another thing I do is when I deploy a model, so Weights and Biases put a comment in my PR table of all the different runs and the run ID's from Weights and Biases. And then I just have a chat command. I say, ""backslash deploy run ID."" And then in actions, I pass that command and I say, ""Okay, give me the run I.D."" and I pass it into another action that then takes that run IDs as an input and then it goes out to Weights and Biases, downloads the model and it pushes it to my serving infrastructure. Well, Hamel I didn't know you did that, Can we give that to other customers? Yeah, I really do. Yeah, it's super cool. I think it's something I'm really excited about actually. Nice, so it's something I would love to play with it. So, basically your GitHub actions lets you build developer workflows and you're using it to do essentially CICD for ML. Let me paint a clearer picture. Imagine a scenario you may see all the time. You open a PR, you want to make a change to a model of some kind. Happens all the time. And you want a way to be able to see if this model is better than the baseline or whatever you might have in production. Now, how do people do that today? Even if you have something really cool like Weights and Biases, that's still a separate system than GitHub. And you might have to go out to Weights and Biases and copy and paste the link into the PR say, ""hey, I ran this code in there"" but that's prone to error. Like that might be a stale run. You might have changed the code since then. You never know. And it's all in different places. You want to go back to the PR, depending, you know, that's like a manual process. It's not a good practice. In machine learning, workflows can take a long time to run, can take from a day to a week, whatever it might be, but sometimes you might want to do a full run before you merge some code. And so with GitHub actions what you can do is you can say, in your PR, ""I'm ready to test a full run of this model"" and then you can issue a chat ops command, say ""run full test"", and then your model runs on your infrastructure of your choice logs to the experiment tracking system of your choice. And then it dumps metadata into the PR in form of comments and other things where you can see all the results of the diagnostics of this model that you want. And then finally, you can decide to deploy this model or do anything else using another chat command and then deploy it. And so now you have this really rich record of everything in the PR that's associated with that PR. And it's getting closer to proper software engineering practices where you have the test that's automatically conducted from the PR itself and you have all the documentation there. I even talked to some Weights and Biases customers. What they do is they have Jupiter notebooks on the side and they copy and paste their Jupiter notebooks into the PRs. And, you know, this is to avoid all that stuff. So it just happens magically. I hope that helps. That was great. That was educational. And I think we should try to package what you did and offer it to customers, they would like it. Yeah. Yeah. All right. Well, you know, we've been talking for like half an hour. So I should wrap up with some final questions that we've been asking everybody and getting really interesting answers. What is one underrated aspect of machine learning that you think people should pay more attention to? I actually think one of the highest impact areas you can try to work on as a data scientist or in machine learning is to build tools for other data scientists. And sometimes the tools don't have to be sexy. So there's another thing that I built called Fast Pages, which helps people share information and write blogs, which sounds really unsexy and it really is very unsexy. But I think that kind of stuff is very useful. So thinking of tools that you can do to automate your own workflow and then share that and then package that into tools I think is very underrated. Nice, from one tool maker to another, I have a lot of respect for that. That's awesome. We'll put a link to Fast Pages, I think it is supercool. All right. Next question is, what's the biggest challenge of making machine learning actually work in the real world, in your experience? I think that there is a gap between software engineering and machine learning and kinda different disciplines that need to work together, to a lot of times pull off a successful deployment of machine learning. And I think part of it is organizational, part of it is tooling. I don't think tooling can get you all the way there. I think machine learning is a team sport and requires people from design, UX of course, to M.L folks, infrastructure people, Dev ops, all kinds of people to pull something off. And so I think that can be a challenge to get those people together and a lot of organizations to do what you want. Have there been any particular dysfunctional patterns that you've seen over and over with that or miscommunications? Yeah, yeah. I mean, I think the main pattern that I continue to see over and over again is, ""Oh, my business is in trouble. Let's sprinkle some data scientists on it."" You don't just hire data scientists to solve a machine learning problem. You need to think about it as a holistic product. And I think that pattern keeps repeating itself over and over again. You're still seeing that? Reminds me of my first job. Yeah. It may not have changed much. I don't know. I mean, hopefully the industry is getting a little bit better. All right. The final question is, if people want to learn more about what you're working on and get in touch with you, what's the best way for people to find you online? Yeah, I mean, there's a lot of ways you could find me; on Twitter. I have a Web site that I haven't updated in a while. But you can do that. Just Google me. There's a lot of stuff like that. Thanks so much for chatting. It was really fun. Yeah. Thank you. Cheers.",5088 +Vicki Boykis — Machine Learning Across Industries,https://www.youtube.com/watch?v=pOnRSYSNuXI,2042,2020-06-03,"you'll come to someone and they say, 'OK, well, I want to figure out customer churn.' And you're like, 'OK, we'll build this model but I can't guarantee that it's going to be good. I can't guarantee it's going to be accurate in the first pass.' But in the meantime, you have to figure out how long you're gonna be at the client, how much value you're going to add. So it's very, very hazy You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Vicky Boykis is a senior consultant in machine learning and engineering, and she works with very large clients and brings a really interesting perspective on how they think about data science and machine learning. She also has a hilarious Twitter account and a really fascinating newsletter that we will link to and I'm really excited to talk to her today. It's really nice to talk to you, Vicky. Could you tell me a little bit about your career and how you got into data science and where that began? Yeah, it's been a really interesting career, I think. Not unlike a lot of other people, but a little bit different. So I don't have a computer science background; I did an undergrad in economics at Penn State and then I went into economic consulting, which is pretty unusual, and it was right around the time of the recession in 2007. So I was happy to find a job, actually, and even more so to find one in my industry. But that involves doing a lot of spreadsheets, tracking global trade movements, tracking internal projects, all that kind of stuff. I started working with data analytics there and then for my next couple of jobs, I worked in data analytics, and then for a job that I had in Philadelphia, Comcast, I started working with Big Data, so we had big data available, and there I started working with Hadoop and looking at big data. What year was that? That was 2012. Cool. Yeah. So right around the time that it was starting to get really big, I started working with that tool stack and that involved me having to get a lot more technical. So at the time I was primarily doing sequel, I had some frustrations with just doing a sequel with Hadoop. Hive was still relatively new, a lot of growing pains there. So I started working with that stack and then from there I started doing more large scale data science, sampling, programming, all of that, and then went on to my next job - a data science job. And I've been doing data science ever since. But ironically enough, I'm moving away from data science now, and I actually wrote a post about this. I think the entire industry is moving a little bit in that direction. So not every job, but the industry on average; and more towards instrumenting the processes around data science. So creating machine learning pipelines, creating foundations and structures that are really solid and that go end to end. And so I think there's still a ton of jobs that are just pure analysis but as the industry grows, as the amount of data that we work with grows, I think the whole industry as a whole is trying to get smarter about replicability. And that's where I'm working in now, more in the machine learning engineering space. So you think the bigger challenges now are becoming more engineering issues than analysis issues? Yeah, I think I agree with that, at least from my perspective. So, again, I'm a consultant. I come into companies that want to build out data science or data engineering platforms, and usually they're starting from a question about, let's say, ""are sales going up or down and why?"" And then we work backwards and say, ""OK, well, you actually don't have the data to do this yet and you don't have a platform set up where you can reliably look at this stuff on a month-to-month basis."" So that's where a lot of the challenges that I see are now. Interesting, Is it maybe because you're going to companies now that are a little bit farther behind, is that possible? Or companies starting from scratch? Or do you think something's changing where people expect to have more built out processes and tools? I think it's the second one. I think if you look at the landscape of whatever the data tools landscape map that Matt Turck puts out every year. So in 2011 and 2012, it was about a quarter of a page and it was just like Hadoop. And that was it. Now I think poor Matt has to put together about five hundred logos into a single page, and there's an orchestration area, and there's a tool's area, and there's an area just for tools around Spark and all that stuff. So, I think people also now have an expectation that if you have something that should be productionizable, and even to the point where we now have notebooks which are generally seen as an exploration tool, there's also some movement. For example, Netflix recently has had around productionizing notebooks. So, whatever workflow you're looking at, I think there's the expectation that in the end it be reproducible to be valuable. I see. It's funny, you know? My last company, we sold into a lot of the Fortune 500, not necessarily Silicon Valley. I always really enjoyed seeing the different perspectives and all the different applications. How has it been for you as a consultant? Does it feel more frustrating or exciting? Or what is it like to go into an organization and then try to teach them how to build up this process? I think it's a little bit of both. So it's interesting because consulting as a data scientist involves both, and I think this is actually true of all data science, but even more so with consulting. It involves both the people piece and the technical piece. So, you have to know what you're doing technically because you're the expert, when you come into the company and you have to say, ""OK, this is how we want to do the architecture"". You are also going to be talking to people who maybe don't want this process at all. You're going to be talking to people who are disorganized. You're going to be talking to people who are for it but don't necessarily understand it. And so a lot of that work is actually talking to people and building the case for this stuff as well. What does a typical stack end up looking like these days? Oh, it's hard to say. So I've dealt with both companies, small and large, a lot of companies are increasingly in the cloud. So it's interesting. I don't think I have any GCP clients that I've dealt with. AWS is, of course, probably the lead. And I did a Twitter question about this a couple of months ago, like, who's using what AWS asked came out something around 60 to 70 percent? Azure, I'm surprised, is really catching up, I think even as little as two to three years ago, they were squarely in third place, no one was even considering them. But now it's really growing, and I think part of that is Microsoft's leadership, plus the fact that a lot of companies in the retail space are not allowed to use AWS because they see them as a competitor and partially because they are stepping up their game in the tools that they're providing. Is there a particular offering from Azure that you like and you think is driving some of this growth? I actually ironically, haven't used Azure a lot. Most of my work has been in AWS, but now that I'm seeing that people are more interested in it, I'm definitely going to have to start looking into it. Interesting. Is there any tool that you think is underrated that people probably should be using or you recommend, that people aren't using yet? I want to say, Bash. That's a really glib answer. But it's really true because a lot of the times when you come into these big, huge projects, you have five or six different AWS services spun off. You have GPUs, you have monitoring, you have all this stuff. And then you start thinking, ""Okay, well, I have all this stuff, How am I going to use it for us? While, oh, I can't test this locally, I can't do this locally, I can't sample the data, what am I going to do with it?"" So I really do think, and I find myself falling into this pattern too, where you use all this big data stuff, but then you don't use the stuff that you have available to you. And it's even easier these days when a lot of us are working with pretty high-powered machines that you can do a lot locally as well. Interesting. So you run a lot of stuff locally? Some, yeah. Especially to test stuff to prototype and in cloud environments, it's really hard to spin up those local environments. So just even to look at the data, to examine what you're dealing with, all of that stuff, so you can do locally and you can bash. Bash goes a long way towards that. So I know that you can't talk about your individual customers, but can you talk broadly about the questions that are driving more interest in data science right now? Like, what's top of mind? What do you expect going into another Fortune 500 company that the executives want out of their data science platform they're not getting right now? The number one question is always to understand customers, understand what they're doing, and understand how what the customers are doing ties directly to the bottom line. And that manifests itself in a number of different ways. The one that I usually talk about, which I've also written a blog post about, is churn. Everybody always wants to know churn. So how many people are leaving, why are they leaving your platform and how much money is it going to cost you on a month-to-month basis? Everybody always wants to know that, and I can guarantee in any given project they'll see it. And then the second one is better understanding operational metrics. There's sometimes not a lot of insight into that. And the third one would probably be classifying customers into different types of customers. Interesting, what's a deliverable you would give a company around trends that they'd be excited about? Can you literally say I can predict churn X percent better or is it if you see the signal then that means turn? How do you actually present analysis? Yes, a lever. It would be literally a platform that has information to be able to predict what churn is going to be, for example, for next month. Usually what ends up happening is a lot of the things that I'll deliver are the data engineering piece around getting all the data all together in one place so we can have a data lake so we can actually deliver that Churn piece. And how sophisticated would a Churn prediction model be today? Are people using deep learning for this or what's how complicated these models get? I don't think they're. I think a lot of times, companies, and even before I was doing consulting in all my previous jobs, people are just impressed if you can get a model out the door. Also lot of the times in the industry overall. So, if you have something that you can benchmark against, it's seen as good, especially because there's so many steps in doing it. So first you have to collect the data, then you have to clean the data. Then you have to go to the customer support team and say, 'does someone calling in mean that the person might churn or not?' And then you have to collect all the manual data that they keep and keep track of that. Then you have to build the model, then you have to do a prediction, then you have to meet with the people who are in charge of this and explain your data to them. And then there's going to be a back and forth there. And then you have to productionalize all of that. So if you can get a model end-to-end going, and I've come into companies where there was zero data science before... And that's why I'm saying that you have to build it from the ground up. Having that is fantastic. And just having metrics where there were no metrics before is a huge step up. And then the next step up is, of course, 'Okay, well, why is this metric different this month? Why is this metric different that month?' So, a lot of the churn models I've built have been with pretty basic stuff like logistic regression and decision trees. I haven't seen any deep learning used for churn yet, but I'm sure that use is just around the corner. Okay, here's a specific question. I mean, decision trees versus logistic regression they do different things. Do you have a particular one you start with or do you try both or some kind of hybrid? So again, it depends on the data available. Usually, it also depends on who's going to be looking at it. Usually if it's people at a higher level, like executives that need to briefly glance and understand something immediately, the decision tree is very intuitive and very easy to explain and it can offer a number of different pathways for discussion. If you just need some sort of model that spits out, 'Is this person going to Churn yes or no?', Logistic regression is a little bit better for that. But again, it depends on the stack that they have. There's different software packages that are better or worse for logistic regression. So, for example, surprisingly, Python, as far as I know, does not have a very good decision tree support. You could do XGBoost, which is not quite the same, but aren't exact It's like multiples decision trees, right? Nested decision trees. Just got boosted trees, but it doesn't offer the nice visual interpretation, I guess, as much as the R package. So, yeah, it really depends on what you have available, what you can do, all that kind of stuff. But I would say all or any three of those are my go-to tools for that. Interesting. So you'll build a stable pipeline that includes R in it? I've done it for stuff where I've had to prototype and throw it out. I actually have not built on our pipeline in production, although I know it's very possible and increasingly becoming more and more possible. Interesting. So do feel like, R, is here to stay or do you feel like it's getting replaced by python, what do you think?. I think they're two different tools for two different things. So I think R is fantastic for statistics, for stuff that you're working on in probably smaller teams and Python is more of a general tool, like if you need to glue stuff together and if you need to do deep learning and if you need to have stuff, you'll use Python. But its basic statistical capabilities are not as good as a lot of the R packages are. How do you think about leaving your work in a state where another person can update it? How does it happen? Do you ever check back in with a client and see if anyone's touched your model and still useful for them? That seems like it must be really hard. What we usually do is we work side by side with the client. So we'll have a person on the client side who is a data scientist so we can hand it off or we'll have teams, and so we do education throughout the process so we can hand it off, and I just be like this person knows how to pick it up and knows how it was being built. I see. That makes sense and you probably pick the technologies they're familiar with or... For sure. So we try to pick technologies that are not foreign to the client. So it's not like they're completely floundering and gone when we hand over PyTorch or something. So what's the biggest frustration in this whole process? Where do you see the biggest room for potential improvement? I mean, we've both sold into big companies and it's challenging. Like you do want to say bad things about your clients, but also do you have any patterns there? I think the biggest issue is trying to explain the benefit of machine learning in a way where it's not always exactly clear. So for example, you'll come to someone and they say, 'OK, well, I want to figure out customer churn.' And you're like, 'OK, we'll build this model but I can't guarantee that it's going to be good. I can't guarantee it's going to be accurate in the first pass.' But in the meantime, you have to figure out how long you're gonna be at the client, how much value you're going to add. So it's very, very hazy. And I think that's more of a frustration for me. But it's also an educational issue where you're not going to always get to a right answer, like the first sprint or the second sprint. It's going to be an iterative process and sometimes, if you add stuff, the model get worse. If you take stuff away, the models get better. So it's kind of hard because data science is always sold or rather I see it being sold as this exact thing, but it's very much like an art process. And so I think that's where some of the frustration is. It's not an exact thing and people expect it to be. And I can imagine it's probably really hard going in, like apparently not knowing the amount of lift. Somebody is like, I for sure want this to get better. So it's like, well without the data, how would you know? How do you articulate that? If someone's like, ""Hey, tell me how much you know you're going to improve my churn prediction"", what would you say to that? First I don't know, I've actually never had it happen that someone was like you have to improve my model by this much. It's usually like let's create a model to do X, Y or Z. But what we usually do is benchmark against previous metrics that they have. And so the goal there is to say, 'look, we're not sure how much we can improve your model, but we can improve the process around the models so that it can be a little clearer.' When you look at the successful engagements where you feel like you really made a difference versus the ones that are more frustrating, are there patterns your more successful clients are exhibiting around data science that sets them up for success? Usually working in a tight loop with me. So a lot of the times the companies I work at will be bigger. And so the data science team will be on one side. The data engineering team will be on one side. The project management team will be somewhere over there. And so I'll talk to all of them. But they don't talk to each other necessarily. And so what I've seen work best is when I'm embedded with a developer, a data scientist, a project manager, that are all kind of working together towards the same thing because there's a big tendency to get silo-ed. So I think companies debate internally about ""Should we have a single data science function or should we embed the data scientists and have the different kind of functional teams, hire data scientists for the individual products that they're working on?"" Do you have an opinion? it sounds like you might prefer data science being embedded in specific products that are specific outcomes, or do you think that it's better to keep it all as a single function so you can hire better people or create a better culture? I'm not really sure I have an opinion on that. I've seen it work well, different ways in different companies. I think probably for smaller companies, I would say less than a thousand people or so, you probably want to have a centralized team and for much, much larger companies, you probably want to have embedded data science teams. But then the danger is, if you don't manage them centrally, then you have five or six data science teams working on the same questions. And I've definitely seen this at companies where it's just replicated work and they're just approaching it in different ways. So you really have seen success in both. Yeah. Yeah, for sure. I see. Do you see specific stages where you're prototyping something and then deploy it into production? It sounds like you're really focused on getting things stable and in production, but do you prototype the steps first and then solidify them? Yeah. You simply say to a client, ""we're gonna prototype and then we're gonna deploy it""? Yeah, that's usually what we do. So usually I come into an environment and you're not really clear on what's going on in the environment at all. You're just kind of thrown in and told, okay, go. So the first step is to gather and assess what's going on. What tools are they using? Who are the key people involved in this? Gather all of that and start to create a model from whatever data that you have available. See if you can actually create that model then many sprints later, take that model to production. It's usually never you come in, you create something and then t's already running. It's usually a lot of human steps in the middle to get it to that point. I feel like everyone always underestimates the pain of taking a prototype into production. What are the biggest challenges that people might not expect, don't usually expect, going into that process? Packaging the model is always a big one. How do you package it? How do you typically package it? What are the options? You could pickle it. You could create a rest and point from it. You could put the model in a docker container and expose end points from it. I think that's something that I've seen happen more and more frequently where the resulting output is essentially a Web app or a Web service and something hits that Web service and you get an inference point. I would say those are the two big ways right now. I think another big thing that people don't think about a lot is metadata management and a lot of big companies want to do metadata management. In fact, I think almost every company that I've talked to over the last five years has said we need some way to manage all the metadata and the data leak so that we can update the models and so the analysts can do the analysis. But there's no single tool for it and I think only now you have open source tools that have started coming out for it. Like was it Uber that came out with Amundsen? I forgot. But there's at least a couple of companies that have a metadata management system. As of the metadata is which variables are in the model, when was this model updated, when was this table updated, all that kind of stuff. And surprisingly, people actually clamor for that, more so than even visibility into how to manage the model. I was kinda curious about what you're gonna say about what the metadata is but it sounds like you give examples of there is metadata about the actual input data. There's also metadata about what the model is actually doing. Sounds like both are important. Yeah. I was just going to say the biggest one is usually people create data lakes. They throw everything into unstructured environments like S3 and then they need to understand what's actually going into those environments and where it's coming from, which is where the metadata piece comes from. And what kinds of trouble do people run into from not having a standardized metadata? What are the issues that come up? Well, they wouldn't know, for example, which tables they can use for what, when those tables are being updated. What a big thing for big companies is whether that data is proprietary or not, whether they're actually allowed to use it. There's all sorts of controls around PII and all kind of stuff. And then usually data lake analysts will also want to query it and they won't know what's in there at all. So it's another way to surface it in a way so that it doesn't impact production. So when analysts are hitting it, for example, they don't hit the entire redshift table or the entire thing in big query. It's just they know what the data is, what's available and what they can take from it and what they can't. And so are most of the models that you're building running in online mode or they run offline in batches? A mix, I would say most of the models that we've built for clients are online or I'm sorry, are batches. I'm working on a personal project now that's online. Cool, can you say what it is? Yeah, I'm actually almost done with it. So I'm working on learning GPT2. It's like a Medium think peace generator. That sounds too dangerous to release. Yes. So the idea is that you put in the first few sentences of a VC blog post in there and it generates a medium think piece for you. So hopefully, that'll be online. But my inference time is five minutes right now so maybe it won't. We'll see. Do you do any monitoring of these models? Is that an issue? Because so many people talk about the input data changes and nobody notices and then the model gets broken and nobody notices. Is that a real issue that you've seen? I think we're just starting this as an industry. I know there's a lot of talk about Observability and catching model drift. And some of the larger companies are really ahead in that space. In general, I would say it's very much an unresolved issue and people usually still resort to checking the database and making sure that the data going in is OK, and that's the level of checking where we are. And I think people are just starting to say, ""OK, well, this is where the model was yesterday. This is where it is today and this is where it should be tomorrow."" Gotcha. I have a question I've been dying to ask you, but we're really not taking it, so. I love working with people that know Bash because I'm always embarrassed to use my Bash skills because I feel I am always lazy to learn. Do you have a favorite Bash command that people might not know? Is that your favorite? No, I didn't say it was my favorite. I said it was an overlooked tool. I don't consider myself a bash guru by any means. You know, you learned like in the last year and you're like .... Xargs, Yeah. It let you do parallel processing of a lot of stuff. So you can simulate two processes. Let's see. CAT is one that I use a lot. CAT and Unique basically lets you do a count start from a database type situation. I would say those are my most commonly used ones. That's cool. Well we always end with a couple questions, I mean we have touched on some of these. But I'd be really curious to hear your perspective on this. What's an underrated aspect of machine learning that you think people don't pay enough attention to? I think I touched on this earlier, but the people part of machine learning. If you are able to get more data or better data from people rather than banging your head against a smaller model, it's always going to go better than trying to figure out an advanced model for it. It's interesting. A lot of people have said that, that seems like a trend. All right. So you've also touched on this a fair amount, but I'm curious about how you would synthesize it. What is the biggest challenge of machine learning and the work in the real world right now? Putting stuff in production Putting stuff in production. And what is in that. What's the hardest part of putting stuff in production? Because there's so much that you need to get right in order to make. Because it's not just a software system. Well, it's just as complicated, but even more because software is, you have a piece of code, you put it in Docker, you put it somewhere and it goes. This is, you have to keep track of data that's flowing in from COFCO or Kinesis or streaming. You have to make sure that all of that data's correct. You have to make sure it's serialized in the right format. You have to make sure that the database that the data is streaming into process it correctly. You have to check all that data. Then you create your models. Your models might work one day. You might get drift the next day. So you have to plan for that. And like I said, I think we're still in the early stages of planning for that. Then you have to expose your model to some service or some end point that's going to consume it. The model piece itself, you have to put somewhere like Docker or whatever. You have to make sure to orchestrate all of that. So this is very similar to software, except, I think we like in modern software development, we have a lot of pieces of the stack that we're now responsible for because of DevOps. So DevOps means in theory, it's supposed to make it easier for you. But what it means is that the software developer also now has to be a system admin and understand some of those pieces in the cloud brought in the fact that you also now have to be a network expert. So actually, a lot of my issues are troubleshooting, like 'why can't this service connect to this service over the company firewall? Basically. So there's all of that. And you have to know the data and you have to know how the model that you're creating works. So putting that altogether in production is really hard. And so I would say that's the biggest thing. Well said. I have a feeling a lot of people listening to this podcast are going to want to hire you. So if that's the case, where can people find you? What's the best way to reach out? You say, well, maybe it's Twitter where you're absolutely hilarious. Yeah. Twitter is the best way. I'm just @Vblakers. And I also write a newsletter called Normcore Tech about all this kind of stuff, data and and a lot more. I can vouch for the newsletter, I'm a six month subscriber since we met and I definitely enjoy it. It's an honor to talk to you today. Thank you. Thank you for having me.",5345 +Angela & Danielle — Designing ML Models for Millions of Consumer Robots,https://www.youtube.com/watch?v=W55uO4gIlQ4,3158,2020-05-05,"If it doesn't actually solve a real problem that a real human is willing to part money with in order to have that problem solved for them, it doesn't matter how sophisticated it is or how much data you have if it's not solving the right problem. You're listening to Gradient Dissent, a show where we learn about making models work in the real world, I'm your host, Lukas Biewald. Today, I'm talking to Angela Bassa and Danielle Dean. Angela is an expert in building and leading data teams. And she's the Director of Data Science at iRobot. Danielle is the technical director of machine learning at iRobot was previously at Microsoft and has a PhD in quantitative psychology from UNC. So, Danielle and Angela, you're both super technical, but also managers and leaders of teams. How do you approach building a technical team, how you got started and what have you learned as you build teams. It's really nice to be here talking with you guys. At iRobot, we've had an interesting evolution of how we approach machine learning within the organization. And it's really quite significant to have this partnership because Danielle is just amazing. And I really, really, really love working with her. The two of us have really good complementary styles. My background is a bit more towards theoretical and applied Math and I grew up in modeling of systems; so modeling soybean trade introgression in agricultural settings, modeling epidemiology processes within certain geographies or disease spaces. All of that is very distinct from robotics but it turns out that a lot of the tooling, and I have to bring in agriculture again, but there is a lot of cross pollination that's very useful. But Danielle also has a really different background that I'll let her talk about which gives robustness to how we approach these problems. I'm super excited to be here today, and I thank Angela for the introduction. It's been great working at iRobot and thinking about building the machine learning organization. I think one thing that's been really great is thinking about how we bring different perspectives and different people into the skillset. So people from computer science background, people from statistics background, people from geology, biology and chemistry who have had really different experiences. It makes thinking about building a team that can solve real world problems for our customers really exciting. So for example, when you two work together, can you think of a time that you brought different approaches to a problem? Or how would you explain the difference between the way you think through things? I grew up working in investment banking and strategy consulting, and then marketing organizations. So my background is a little bit unorthodox, but very real world oriented, which has been helpful. And one anecdote that I like to highlight to this end isn't my personal one, it is another leader on our team; Theresa Borcuch who manages the data science team and has a marine biology background. When we have been thinking about how is it that we analyze data that's coming from teaming missions, we have robots that can work in tandem. There's the mopping robot and the vacuum robot. So you can have the vacuum robot your room and then go to a different place. And then the mopping robot comes back afterwards and completes the mission. And so she had a really interesting way of thinking about that problem because she had done a lot of research with pods of dolphins. So she's looking at the artifacts of that system because dolphins can't tell you what their intent was, they can't tell you what they are trying to do, but they are acting and through sensors and data collection, you can get the artifacts of that and then do analysis to try to to derive deeper insights. Theresa was able to bring all of that knowledge. That literature exists, right? That knowledge exists and she was able to think about it differently and really enrich the way that our team dealt with that. So we do have some bias towards real world experience and ways in which we understand that we're looking at fossils. And so when we project what the dinosaur looks like, our dinosaurs tend to not have cartilage and feathers because we wouldn't have known that just from the fossils. But we know that the real world is richer than what we're looking at and I think that allows us to build more solid answers to how we use these ML applications. Interesting. Do you have anything to add? That was such a good story, I just can't add to that. It's funny. I mean, just as an aside, I remember the first machine learning team I worked on, my boss, John Mark, he always said that he'd like to hire a biologist because they look at the specific examples versus. He actually was a physicist, and he's like, I hate having other physicists because they always try to look in generalities instead of being specific. And he felt like what the team really needed was more people to look at the actual training data, the specific things that were feeding into the model. So it's funny, I hadn't thought about that in fifteen years, maybe, but he always was. He was always talking about that. Yeah. And it's huge in our domain because we have these robots deployed literally in every timezone globally, all across the world. And you know, I've never lived in every place in the world so we will look at the data, we will see information, and I will construct a mental model of what might be happening. But if we don't have a team that's robust and can challenge a lot of our heuristics and baked-in assumptions and go, ""I don't think that means what we think it means""... And that enforces exactly what you're talking about, descending into the particulars and really examining what it is that the data is telling us, not what we hope it aligns with, because we have this perfect first principle's model of what it should be. So iRobot has been around for a long time building robots and I think robotics, traditionally didn't do a lot of machine learning. I mean, I think it's always obviously relevant but it wasn't machine learning first until maybe recently. I'm curious about the evolution of, if you know it, of iRobots thinking about ML. When did you start to think, ""Okay, we need to build ML models and get them in production""?. When did you start building an ML skill set at iRobot? I think it's an interesting thing about iRobot history when they first started over 30 years ago now. Almost every single part of the solution had to be made by iRobot thinking about, from the navigation stack to the hardware stack, to all the software pieces, to how the robot navigates the world and how it uses all that different sensor information. But as the field has changed a lot, and especially as the AI field has changed and deep learning has really transformed and the quality of solutions that can come from machine learning have really transformed. IRobots looked at, ""hey, there's solutions out in the industry that we can leverage on and we can improve our products using those skill sets and expertise"". So I think machine learning has been at iRobot for quite a while now, mainly started in research areas and thinking about how can we use that research, how we can start these research projects and think about how we can improve the quality of our navigation stack? how do we improve the quality of our cleaning solutions,how do we improve the quality of our robotics overall? But just recently, in the last few years, it's moved from a research stage to actually being in production solutions. Everywhere, from improving our digital experiences in the applications to improving our hardware solutions for things like navigation and cleaning experiences. So it's been interesting to see the journey. And I think machine learning is starting to be a bigger piece of the iRobot solutions moving forward. When do you think it first became a production thing? And what was the impetus like? What what was the application that drove it? Initially, I think the first application that we could really call ML, if not ML adjacent, really ML, is Slam. Which is the localization and the mapping of how the robot navigates its space and that uses a lot of the same methods and a lot of the same, now modern tooling. Back then, a lot of the tooling had to be invented and by iRobot too. And so I think the legacy of ML applications in production at iRobot, if you include the same component, is actually a quite longstanding one but in terms of the more typically defined ML, I think it really wasn't something that could have been as important a part of the strategy as it is right now until the robots became connected. So for a very long time, these robots were completely self-sufficient, closed systems. They came out of the manufacturing line, onto an inventory somewhere, into a customer's home, and then it just worked and there was no way to either collect any information off it or to send information to it to modify its behavior. In late 2015, when we started having the IOT connection component to the stack, that really opened the door to improved data collection, which is really the thing without which you can't have meaningful ML that actually does something that feels auto magical to the customer experience. So I think that's, from a historic perspective, where iRobot started using ML in the production environment and I also think that one thing that shapes the iRobot story quite deeply is exactly what Danielle was saying; is that because a lot of the initial tooling was built in-house, we have had a really interesting blend of standard tool choices versus the deep customization that's really part of the DNA of the company. So that's one of the things that has been really interesting over the last two years, especially once this really proved itself internally. And we've been able to demonstrate just how valuable this is, both for our internal use just to improve our own software development capability, but also to help shape where the strategic roadmap might go, as well as all the other ways in which we can use ML answers to improve the business. That's so interesting. So it was really Internet connectivity that made ML applications possible. I'm not sure if it's ""possible"", but it's what finally made the case that even if it had been possible before, it might have been prohibitive to collect enough training and validation data for these things to be robust in a way that will work on an apartment in Singapore, the same way that it will work on a ranch in Texas, right? So the ability to really collect a complete picture of what it is that these machine learning algorithms are going to be interacting with, it really didn't make sense because we wouldn't have been able to meet the magical delightful expectation that our global customers might have had. Cool. So tell me about your team, what are the different people doing? What are the relative sizes of investment and different modeling, deployment.? Are you mostly researchers? Are you mostly engineers?. How do you think about that? What are the different functions and how do they all work together? The team has changed a lot over the years as we've shifted from more of the research side of proven concepts of machine learning application that we can do, to thinking about production applications. So we have a team that's dedicated to thinking about machine learning algorithms and models, thinking about how we develop those models? What is the appropriate type of data to feed into those models and how do we improve them over time? So that team would essentially be machine learning practitioners. I mean, how would it? Apparently they're using ML mostly? This is the only part of the team that will be stereotypical ML. So I'll talk a little bit more now of the other parts of the team. When you talk about a machine learning team, everybody thinks, OK, that's people doing algorithms, right? We do have some of those people. But its like.. Can you say a little bit about the relative size of these teams. I think it's just really interesting that you're in production and doing the stuff. Yeah. So the modeling team out of the machine learning team all up is actually of a small size, I think that they're about a quarter of the size of the full team. So the modeling team, they're doing the algorithm development and then there's complementary teams that work beside them. Another team that's working is on more of the integration side. So how do they take the algorithms and actually deploy them to the robots? They are thinking through what is the supporting infrastructure with the bottle conversion process to run on this limited hardware. Obviously, we need to not use the same types of models that you can do with big GPU machines so thinking about the integration of these algorithms onto the robot, What do you call that team? Is that like ML ops? We actually call it an integration team. Integrating into the robot software and the surrounding infrastructure around it. Are these people kind of hardware-ish people? They're still mostly software but have an understanding of the hardware too and they can work with folks in hardware on specific applications. And one interesting part about doing machine learning applications on robots is it's not just the ML algorithm, but it's how it interacts in the system as a whole. So these are the folks that really need to think through the end-to-end solution of how and what the metrics of the system are? How does it affect performance of the robot at the end of the day, things like we want cleaning coverage, we want robots to clean all parts of their house, we want to maximize these metrics from the customer's point of view, not the algorithm point of view. So are some of these people more business focused? They're not but they're thinking about how to capture the metrics to report to the business folks and thinking about what those metrics are. So thinking about the system level metrics rather than just the individual algorithm metrics. And just to complement that, it's really iterative. So we have an R&D organization and then a product organization and they're sister organizations. So Danielle and I are on the R&D side of the equation and there are the teams that she's going to continue to talk about. There's the four teams within the machine learning proper, and then each one of those has a counterpart within R&D that's attached to a program. So if it's a specific robot or a specific feature or a specific customer deliverable, there's somebody that's integrating that machine learning component with all of the other R&D components that are necessary to deliver the complete final box solution, either as an over the air update, an OTA or to the manufacturing frame work. On the third leg to that stool is the product team. So we have business counterparts who own that product feature deliverable from the business standpoint. They may not know what's technically feasible or they're actually incredibly robust at iRobot because we do have a bias for technologists so that tends to not truly be the case but they have these R&D partners who can help them iterate on how do we define these metrics. Do we care only about end states and lack of interstitial errors or do we care about that within the context of cleaning coverage and cleaning efficiency and all of the other low abandonment rates, all of the holistic experience of what this thing is supposed to deliver? And so that spans the gamut but for instance, on the first team that Danielle was talking about, the modeling team, the way that they talk to their business counterpart is almost similar to the way that they talk to the integration team. So they can write models and write papers and then tell the integration team that ""this is the aha, this is the thing, the solution"" but they are not as constrained with, ""What's the compute accounting? How much budget do we have to be able to run this as the robot is doing all of these other things that it actually has to do as well?"" And so it truly is a really well integrated and iterative process. It's not that they go into their nerd cave, they come up with a solution and they come out and toss it over the fence. Yeah, definitely. So after the integration team, there's a team that's focused on actually what you mentioned before. Look at the platform and operations teams that's separate from the integration team and this is the team that is helping with the infrastructure that supports both modeling and integration work. So things like, how do we scale out our compute? How do we train our models and store models and things like that? So how do you think about the different skillset between that team and the integrations team? So the integration team is a lot in our case because we're deploying on a robot, therefore they also have often C++ knowledge to be able to integrate the models into that environment. They have built system experience around the robot software. So the integration team is more focused on how do we make this stuff real on the robot versus the platform team is how do we build supporting infrastructure to support both of those other teams? And the platform team is, generally speaking, more cloud focused. We use AWS at iRobot for a majority of our applications and so how do we build serverless applications on the cloud to support those different applications. We found that having an integration team separate from the model and separate from the architecture and separate from advanced solutions is really important. Because of this, the last mile of deploying machine learning to production is really long. And we have all of the added complexity that our last mile is incredibly restricted, automated robotic sensor, autonomous platform that's deployed in environments we can't control and we can't really modify. So having a team dedicated to focusing on just how hard that last mile is has really paid off in terms of, yes, if it's added complexity, having a whole another team added specialization like we, that's not free, but it's definitely been a net positive. And what's the last team? So the last team is a reinforcement learning team, they specialize in reinforcement learning applications. And they work closely with the integration team and the platform team for the same applications as the modeling team, just the modelling team is focused on other applications. Like a team for researchers essentially. Yeah. Researchers and making reinforcement learning real for iRobot. Interesting. So among those teams, which is the hardest team to hire for? Oh, that's tough. Oftentimes it's these little gaps in between the teams and these things that you don't expect you need and then you find you need in going through a project, and one often overlooked part is aspects around how do we deal with data and how the data feeds into the rest of the systems. So even folks who are supporting the team in how do we use the data and how we curate the data in such a way that our models can improve and our systems can integrate. Often it is the work in between the cracks that's the hardest to figure out. How do we fill these gaps when there's no clear work that the other teams are working towards? One thing that I will note, though. Two things actually. One, we are hiring. So if you have experience in embedded systems and machine learning, toolchain, automation, let me know, we're very much hiring. But also, one thing that's amazing about iRobot is that it's actually not that hard to hire. I mean, yes, it is, because this is a very specialized skill set that we're looking for, given that we've reached a level of scale where that specialization pays off for us. But iRobot is really a cool place that's really well connected. We have a lot of really smart people working alongside us who have very broad networks. So we actually get a lot of really smart, really competent folks reaching out to us a lot, which I'm very grateful for, because as you've noted, these tend to be really hard skill sets to find. But also, everybody, even six year old gets tickled when they get to answer, ""What do you do for a living?"" ""I build robots."" That really doesn't suck. I'm curious. Do you notice that there's any challenges with cultural differences or miscommunications between these teams? I think you actually do hire fairly different people, has there been any kind of lessons learned from trying to get everyone to work together to build something? I think one really important thing in thinking about building machine learning teams, especially when you get to the point where you have different specializations and different goals of these sub teams, is thinking about where the boundaries are between teams and how handover happens. So one way that we've tried to address this at iRobot is we've built virtual teams that work across team boundaries. So for a particular application, there's somebody from the modeling team working on it, there's somebody from the integration team working there, somebody from the platform team working on it and so they're there like a virtual team that works across boundaries. Because if otherwise, there could be some gaps in handover between teams. So trying to build these integration points between the teams is really helpful in getting an end application out the door. Also it's really helpful to have these teams not feel like there are these arbitrary walls in between them so the virtual teaming that Danielle described is one way that we do that. But making sure that the folks get a lot of time together, discussing scientific papers, getting together and talking about recent developments, attending conferences together so that there's really this strong culture that allows for a lot of that connective tissue to happen so that the gaps don't exist. Things do fall through the cracks, there's no way to avoid that. But the fact that we know it will happen, we create a culture where folks are attentive to it and intentional in seeking those things out to catch them rather than going, ""Well, it's not my problem"". So I think the fact that Danielle is really intentional about building that kind of cohesiveness, that I'm really intentional about paying attention to that kind of cohesiveness is one of the things that has been paying off in spades for us in terms of the speed with which we can get these things to happen and not have to constantly take two steps forward, one step back of ""what didn't make it this last time?"" Is it a challenge for you to weigh the priorities of what the company needs in the models? The company needs versus paper writing and academic conferences, has that been an issue for you? Imagine..But that has been the case for 30 years; you hired the best and the brightest. You hire people with a bias towards nerdery. And obviously, all of us want to be big bad nerds, and big bad nerdery doesn't necessarily always pay the bills. So that's why having that iterative component and having a sister organization in the products team, that is well plugged into what the end goal is, because I think once we reframe how the team talks about it and we stop the dichotomy of ""are we doing cool research or are we doing boring product work?"", when we reframe it to 'are we solving a real problem?' Not an abstraction. Not a theoretical articulation of a problem. But are we actually seeing in the data that we're collecting back, the telemetry that's aggregated for that we can look at? Are we seeing qualitative impact into how humans behave and how they behave with respect to the behavior of their robot? I think getting that and celebrating that and highlighting that, packaging that and communicating that back to the team so that they get jazzed about the fact that what they're doing, that last long mile once you cross it, you can actually see a meaningful feedback. I think that gets the team really excited to be able to play that game of research and bottom line work in a much lighter way than I had been exposed to previously. That's cool. So like measuring and communicating. And having, dashboards across the office where you can see how things are moving, see how the models converging both in test, but also whenever it's deployed in production and seeing the impact on the fleet and communicating that out to everybody else in the company and hearing all of our partner nerds who nerd on different things celebrate that success. I think it just really becomes an interesting virtuous cycle. When you interview ML practitioners and those adjacent, do you interview for something a little bit different than other companies, like Microsoft or Google? Is it the same stuff or is there something different about iRobot that you look for? Good question. I do, but I've never hired for Microsoft or Google so I wouldn't know if I would have had some pushback but I was already at iRobot when Danielle joined. So I was really happy to welcome her to the team. But it didn't seem like there were too many fundamental differences with the way the transition happened, is that right? Yeah. Yeah. I can't think of any major differences. I think a bias towards a love of hardware helps, Yeah. I think if you don't love the fact that these things are going to live outside of your own realm, it's just not going to be as exciting. And, you know, everybody wants to work with somebody who's excited about what they're doing. I think that it does seem your kind of quality control and testing must be pretty rigorous because it's probably hard to go back and fix things. Can you say a little bit about how you approach that? Model quality control and testing is a huge part of what we do and as much as we can. A lot of these different layers we've automated so that when we're developing software, we have automated tests that run even automated tests that run on the robots. We think about it at almost every single layer of the system. So from model algorithm development, thinking about how do we validate our models off-line and even that very first step is tricky, right? Because we're building robots that run in the real world across, millions of homes, all the different variety of homes in the world. So even that first part of how do we test our models off-line to make sure that they actually will work in the real world? Even that is tricky. But after we get over that hurdle, we then think about the next step of deploying on the robot and making sure all those pieces that we need are together. So from the model conversion process to what pre-processing happens, to making sure that the hardware that goes into the machine learning algorithms is the same as the applications that are developed when we develop the machine learning model. So thinking about the variety and hardware that goes into the data collection aspects and then where the machine learning model is running, that the software is running correctly.. So metrics for off-line, metrics for on the robot, and then also once we deploy the model in production, making sure that we monitor, continue to monitor that model and continue to understand feedback and improvements so that we can also send updates to the model when necessary. So there's a lot of different layers to that. And it gets really complicated because we are thinking about how we can generalize to the real world and the real homes; and there's a lot of variety in that. There's a component to testing that goes beyond just model quality and model applicability, because these are robots that are in consumers' homes. So we also have a regulatory and a compliance component to the testing. So the Roomba, the Braava, they have a set of testing. But also, when you start thinking about things like the Tarot, which is the lawnmower, that has a completely different kind of fail stop mechanism because it's got literal blades that are cutting grass and you definitely don't want it going over fido. So we have this testing culture and mentality that goes all the way from how the hardware works, to how the software works, navigation, and then through all of this, the machine learning features that are deployed through these products, be they, you know, recommender systems or computer vision types of applications. Have there been any interesting surprises? Like as you've gone from these specific unit tests to more bigger picture tests of production, have there been any things where you are like 'we didn't see that coming'? I mean, I'd say there is.... It is amazing that you continue to find little aspects and you continue to realize throughout the journey how important testing is and not just for finding things, but actually helping you move more efficiently. So having checks at the different layers also helps you debug faster. It's not only knowing that something is going wrong and being proactive, but knowing something's wrong and knowing where something is wrong in the system because we have tests at each layer so that we can really hone into where the problem is. So I think that's something that I've really enjoyed seeing over time - how building in these different layers helps us move more, move faster in the long run. So they've all worked perfectly to prediction. Definitely not. We've built up tests over time. But it's really interesting because we've also learned through this process the navel gazing process of how do we test, how do we ensure and how we root cause. We also noticed that across iRobot, our team is getting called to help other teams root cause what they're going through because of the way that we interact with data. We have brilliant technologists, roboticists and algorithmcists, but they are not all trained in the gospel of data and dealing with really large data sets that require imputation that are gappy, that aren't necessarily fully representative, comprehensive of what it is that a software developer has access to when they're looking at how their IDE is responding to them. So we are able to bring that and help different teams that sometimes have nothing to do with machine learning; to help them look through the information that our robots are sending back at telemetry, whatever it may be, and then help them build tooling that looks at fleet data to help them towards their specific scenarios, even completely unrelated to machine learning altogether. How do you think about testing models that are inherently non-deterministic or at least, you can't be sure that they're going to be accurate every single time? Do you have some threshold or do you look for distribution drift? This is something that we actually think about a lot because of our systems. We also have data deletion requests that come in and obviously we comply and think about how we build in user trust and user privacy into our system. So we think about this a lot because our source data actually changes as a result of data deletion requests. And so thinking about how to build reproducible pipelines that we know, what are the components that went into that pipeline so that we can recreate training datasets based on both the distribution and based on what data went into that model originally, minus the data that was deleted or things that came into the system and new data that feeds in as well, because we also want to supplement our data sources with any new data so we can improve over time and and get better features in the end. So we think a lot about how to create reproducible pipelines, how do we track work? This is really fascinating. So when you say reproducible pipeline, but the data is changing, what exactly does that mean? It means understanding exactly what distribution we're pulling from and what sources we're pulling from and being able to reproduce the processing that pulls and creates the data set that goes into the model. So I see, the process is reproducible. Yes, exactly. But also, whenever we do reach some convergent model that the interpretability and the explainability of that model while all of the training data is still there as intended and designed is really important because of the ephemerality of that data. So if the underlying data and the underlying training set does change. What it ultimately generated, even if we have to change things in order to be respectful and sensible stewards of our customers' data, we ensure that whenever we do have a model in hand that the learning of it can be reimplemented in addition to to the reproducibilty of the pipeline so that we can't have just blind black boxes that we just retrain at will because of the regulatory and compliance environment within which we operate. And for the reproducible pipelines, we also think about the metrics that come out of those. So some of the metrics we were talking about earlier of the metrics offline, the metrics on the robot, all of those assets where we're working towards full automation so that we can have these reproducible pipeline and reproducible metrics at each layer of the system that we know That's interesting. So these metrics, when you say these metrics we talked about, you're referring to the end-to-end product or business metrics or do you mean something specific to that model? Both. So with model metrics that come out as well as how that model performs in the bigger system, which there could be a whole other system that it's interacting with on the robot, you need to think through both the model metrics and we might need to actually tune the model differently in order to optimize the system metrics. So thinking about those different layers to be able to reproduce, capturing metrics and then creating those reproducible pipelines and how we think about approaching that problem. Interesting. And have you seen modern techniques? How much has modeling improvements improved these metrics? Has that been important to you? I mean, some people say, ""oh, it's just about the data"". Some people say ""we'll know deeper, more sophisticated models"". They really matter. Some people think hyper brain research is a really good idea and some application seems like it doesn't help very much. Where do you stand on those types of things? So I'd say data first. For our use cases, because we're trying to hit all of the users, all of the houses, we are able to generalize to the world and all the variety of the world and making sure that our data is generalizable in that way and is the most important. By the physical world. The physical world. Yeah. So threshold or threshold between one room and the other... The architecture of 1800 Germany is something that you really can't modify. Like houses in Berlin will be the way that they are and so it's really a different world. When you're talking about a robot that we're going to be approving these models, we're going to be sending new models over air update. But the computer that robot has available to it and the environment within which it operates is static. And so that's the constraint, which is why we value the data collection so much because of that sort of constraint, essentially. Absolutely. And the other aspect is the algorithms that run on low compute. I think the advancements that happen in that space, there are little things that you can do that make a big difference. So thinking about that area, I think there is a lot of room to go in terms of how much algorithm improvements can make a difference. Did they matter to you over the last couple of years, like the stuff that's been really meaningful? Yes. Especially advances in inference time as well, because we are running in such constrained compute environments. The small differences can make a big deal. And this is beyond the hardware improvements. This is actual model improvements. Right, Yep. Cool. Just being really specific, I'm curious. When you think about hyper parameter searches, that's an auto ML, is that something you do a lot of? Does that help you much? Yeah, we definitely leverage hyper parameter search and auto ML type capabilities to see what the space is out there and see how we can improve on data. Augmentation approaches are also very useful and thinking about how we can supplement the current data that we have to try to make sure that it is generalizable to the world. So data augmentation approaches vs. model architecture, which do you think is more important? That's a tough one. Both. Obvious. Fair I mean, if we had infinite researches, I would agree. But I think looking back at what we've actually had to overcome over the last year and a half, I'd say that data augmentation has been more of a throttling or a bottleneck to us. I'm not sure that that is always going to be the case. But specifically. It's been a problem for you or it's been helpful to you? It's been helpful. So it solved the thing that was a real problem for us, a bigger problem. So the ability to understand every environment from day zero, right? Once a robot starts going around a space, a customer, somebody who just bought it, they don't want to give that robot a month to get good. They want if you wait a month, they're going to stop using it and then you lose the chance of having the light of that customer. So starting off with a really solid day zero solution was really the big important objective for us. But I'm not sure that that is always going to be the case which is why I think both if we had infinite resources. Its early 2020, things change for sure. Have you focusd on one deep learning framework? Is that Tensor flow or PyTorch or do you love Scikit? Or is it all of the above? It depends on the team. When we're talking more about the reinforcement learning team or the modeling team, I think they get more leeway in experimenting because that's the role. Like, it is the stuff that we've settled on - continuing to work for what we want and will age pave the way for what we are going to want. So those folks test a little bit more, go a little bit more wild into the research and what's available and what are the state of the art improvements and how can we make those applicable within the environments that we operate in? I think when we're talking about the integration and the infrastructure teams, they're focused on different things. So they are much more focused on that last mile and so re-architecting all of our tooling, unless there is a really important benefit that we get from that, because at that point you're impacting all of the platforms that may make use of this one deliverable and a lot of those platforms are already in homes. So if we have to re-architect how all of that delivery happens, there are a lot of teams and there's a lot of integration that needs to take place. And so is it worthwhile to spend those resources? The farther into the process we get, the harder it is to stay, to modify and to tinker with that stuff. But I think the modeling team. Yes. And I think they've done Tensor flow, and SciKit line is always a favorite of all. I think we've dabbled in PyTorch and figured out it isn't working for us. But once it gets down farther into the pipeline, it is much more into what's already working and 'is it worthwhile to throw a bomb in there'? Can you say what's already working? Not yet. But we look forward to coming back and sharing a more details on all of that. Cool. Can't wait. I want to be respectful of your time. I think we've gone a little over. I'm hoping to end with a couple of questions, if you don't mind. We've been asking everyone. So here's my first one; what is the one underrated aspect of machine learning that you think people should pay more attention to? Oh, the data. The data is so important for making sure that it's generalizable, making sure that actually figuring out your metrics are all going to be based on the data that you're using. And so how do you know the metrics of the model or the metrics of the system are actually metrics that you believe are useful to represent things? Because at the end of the day, it's all about the data that goes into the system. And it's not just the volume. I mean, one hundred percent. It's the data. But it's not just the volume, the variety or the velocity of it, like the fancy 3D. But it's the quality of it. So are we only looking at data that's streaming back from us? From customers who are already happy? Which means we're not solving the problem for the folks who aren't already happy to begin with. We're not solving the problems for the folks who aren't using the product that they've already paid for and we're not delighting them. So it's not just that we have data, but is that data reflective of the whole? And how do we ensure that the next thing that we do isn't just improving marginally the experience of a fraction of our customers, but yes data. 100% data. Underlined. Why do you think it's so hard for the industry to realize how it could still be underrated? Seems blindingly obvious from where I sit. I don't think that it's underrated. I think when you ask this question, you probably get the same answer. It's not like it's underrated. It's just hard. It's known to be valuable. It's also known to be really hard. They know a few companies that will collect high quality data. No doubt. So, next question. What is the biggest challenge in making machine learning actually work in the real world from where you sit? When you start a research team coming up with something and some crazy framework to actually getting it deployed, what is the hardest part of that? What makes you the most nervous? So a really tough learning for me was not at iRobot. It was at the company that I was at before. We had this amazing solution; a machine learning algorithm that was for energy efficiency and it predicted when there might be a spike on the utilization of the energy grid or something of that nature. And it worked really well. It was really fantastic. We were all proud of it and it didn't sell. And so the really important thing is communicating to the right people. What is it that they're getting and what is the value? Because I think and I was blinded by this, too. I'm not saying that I was smart and I knew it and nobody did. Nobody else did. I wasn't above it. I fell for the mirage of how delightful and wonderful and sexy and how performant and correct it all was. And that doesn't matter. Like if it doesn't actually solve a real problem that a real human is willing to part money with in order to have that problem solved for them, it doesn't matter how sophisticated it is or how much data you have on solving the right problem. Danielle, I don't know if you have a different anecdote or a different experience. So one of the biggest challenges that I see with machine learning systems in the real world and getting them out there is, there's a lot of different pieces that need to go in to make this right. With the data collection piece, the training part, the processing, the model serving, the software pieces, the hardware pieces, the hardware. In our case, the hardware changes over time and we also need to make sure the data reflects that hardware changes over time. There's the different homes, the different customers, making sure that it's actually generalizable. So I think there's so many pieces making sure that you have quality checks in place on those different pieces so that when things don't work as well in general or for a specific customer, we have ways to make it better so that the quality at the end of the day is what the customer is expecting. So I think that's the biggest challenge, is just connecting all those pieces together and making it real. Cool. Final question, you know if people enjoyed this, I bet a lot of people are really going to enjoy this, and they wanted to learn more about your work or reach out to you, is there any anything you'd like to link to or a Twitter account that you use or anything like that? Well, first, you should definitely go to LinkedIn and you should look at the iRobot page you should apply, you should come to work with us. That's the number one. The iRobot.com website. Seems like it will be really fun to work with. It is going to be super fun. I am not at all biased because I love the people that I work with and I love the work that we do.But that's one place the iRobot.com website actually has a lot of things for the consumer, but it also goes a little bit into what we do and who we are. So that's useful. Personally for me, you can go to my website. I'm at AngelaBassa.com. I'm also prolific on Twitter, which is a problem, I'm working on it! @angebassa. And I'm on Twitter as well. @DanielleODean. Awesome. Well, thanks so much. That was super fun! Thank you!",7930 +Jack Clark — Building Trustworthy AI Systems,https://www.youtube.com/watch?v=nv_f1Gk8Ybk,3356,2020-04-22,the challenge is like well I didn't sign up for this like I wanted to do AI research I didn't want to do like AI research plus societal epics and gear politics that's also not my expertise I think that's a very reasonable point unfortunately there isn't like another crap team of people hiding behind sample two entirely shouldering the burden of this you're listening to gradient descent a show where we learn about making machine learning models work in the real world I'm your host Lukas be well Jack Clark is the strategy and communications director at open AI before that he was the world's only neural network reporter at Bloomberg he's also one of the smartest people I know thinking about policy and AI and ethics I really am excited to talk to him I feel like I typically get nervous when people ask me like you know kind of big policy questions about AI and I never really feel like I have much smart to say and I think the goal of this podcast is mainly to talk about you know people doing AI and production but then when I started writing down questions I want to ask you I was like wait a sec hit like I want to ask you all the policy questions and all the weird questions that that ever it asked me cuz I have no idea and I like why should I seriously want to know because I feel like you think about this a lot I mean this is such a cliche question but I'm like actually fascinated by how you're gonna answer it which is what what probability do you put on like AI apocalypse oh good so we'll start with the really easy question and go from there yeah yeah yeah what's your like is it like 1 in 10 like 9 out of 10 one-in-a-million like what do you say I apocalypses is quite high it's like 50% but that's only because the chance of most apocalypse is if they get to the point that they're happening like say pandemic which is something that we're currently going through it's quite clear that most of today's governments don't have capacity or capabilities to these really hard challenges so if you end up in some scenario where you've got like large autonomous broken machines doing bad stuff then I think your chance of being able to like right the chip broke the ship is like relatively low and you don't have like a super positive sort of outlook I think the chance that we have to like avert that and get ahead of it it's actually quite hard but I think your question is more like if something like wakes up and we enter this very very weird territory what are our chances and I think if we don't do anything today but our chances are extremely poor no okay so yeah I think maybe I agree our chance of surviving in AI apocalypse probably low but I think my question actually is is what do you think the tests are like actually entering the AI apocalypse I'd remember that all apocalypse scenarios you know they they kid some some more than one right so like I mean in a way like the like a pandemic apocalypse like unless you think they're sort of like linked that should make the AI pocket right I think there's this is kind of like at the beginning of when you started to do massive amounts of computer trading on the stock market say what's the chance we're going to enter into a high frequency trading apocalypse and I think I'll someone would have answered that is it's really high we'll have problems but it's fairly unlikely but the whole system will topple over due to high frequency trading and I think for my answer on AI is pretty similar like it's really high but we're gonna have some problems because it's a massively scaled technology that does weird things really really really quickly and we'll do stuff in areas where we're finance is also deployed huge amounts of capital so the opportunity for big things going wrong is Caterpie the amount of a total apocalypse feels a little fantastical to me and that's partly because I think for a real apocalypse like a really bad severe one you need the ability for AI to take a lot of actions in the world which means you need robotics and robotics is you and I both know is terrible and actually protects us from the huge amounts of many parts and for like ornate apocalypse and areas I way that I think about this is you develop a load of radical technology and and some of the greatest risks you face aren't with technology deciding of its own volition to do that stuff that very rarely happens it's even unlikely here there's a chance that you kind of get black mold but technology like somewhere in your house you have not been cleaning it efficiently in you don't have good systems in place and something problematic starts developing in a kind of emergent way barely notice but that thing has really adverse effects on you and it's also hard to diagnose the source of the problem with why it's interesting so but okay so that's actually like a little bit less of a that seems like a much more concrete scenario like I guess what dumb what form might that take I mean it sounds like you're mostly worried about sort of the things we're doing now we get we get better at doing these bad things and that causes big problem or like top of mind is like concerns for you yeah well I I guess I bring my concern is we're currently pretty blind to most of the places this could show up and we kind of need something that looks a lot like weather forecasting and you know radar and sensors for for looking at evolutions and mr. main the sorts of things that I'd be worried about are scaled up versions of what we already have recommendation systems pushing people towards kind of increasingly odds areas of sort of content or subject matter but we kept on realizing all quietly radicalizing people or making people behave differently with each other I worry quite a lot about sort of AI capabilities interact with economics so you have some economic incentive today to create entertaining disinformation on this information I think we think about what happens when when those things collide you've got good AI tools for creating this equation or disinformation and an economic incentive and start start showing up I think we're going to be relatively few grand evil plans I think we were going to be lots and lots of like accidental screw-ups that happened at really really large scales and that happened really really quickly we'll self-reinforcing cycles and I think that that's the challenges is you not only need to spot something that you're going to need to take actions quite quickly and that's something but we're traditionally just really really bad at doing as people are we good we can observe bad things happening but our ability to act against them is quite low but yeah I mean so you do like a lot of work on like ethics nai and a lot of kind of thinking about it but it sort of seems like those scenarios sort of feel is like is AI special they're like it seems like there's kind of a lot that might be like just sort of general technology risk right do you think AI makes it sort of different I think delegation so like technology allows us to delegate certain things technology up until many sort of practical forms of AI lets us delegate highly specific things so we can write down in a in a sort of procedural way and AI allows us to delegate things which have a bit more inherent freedom in how you choose to approach the problem that's being delegated to you like you know make sure people watch more videos on my website it's kind of a fuzzy problem you're giving a larger space for the system to sort of think about it and so I think the epics now they're they're not some people humans haven't encountered before but it's full of ethics which is kind of has a lot in common with military or how you do administrative states in the old days which is a sort of ethical nature of giving someone the ability to delegate increasingly broad tasks to hundreds or thousands of capable people that's like a classic ethical problem that people have dealt with for hundreds or thousands of years but with AI now almost everyone gets to be about delegation and that really hasn't happened before we haven't had this scale of delegation and this ease with which people can kind of scale themselves up and so lots of the ethical challenges like okay people now have much greater capabilities to do like good and harm that they did before they have these automated tools that kind of extend their ability to do this how do you think about the role of the tool developer in that context because sure you're building just like iterations on previous tools that the scope of which arose tools will be used the areas in which will be used is much much broader and you've sort of dealt with before and I think it introduces ethical considerations for you but maybe governments OB previously that word I see so so in your view AI kind of allows like single individuals to to have sort of broader impact you know and therefore the tools they actually make available to folks there so there's more ethical issues within that yeah like a good way to think about this is oh I think language models are interesting here's an ethical challenge but I find interesting with language models you have a big language model that has a generative capability you want to give that basically everyone because it's a sort of analogous to a new form of paintbrush it's it's very general people are going to do all kinds of stuff with it except this paintbrush reflects the implicit biases of the system sort of data that it was trained on at massive scale so okay it's like a slightly racist paintbrush the problem is now different to just having a paintbrush you've got like a paintbrush that has slight tendencies and some of these tendencies seem fine to you but some of the tendencies seem to reflect things that many people have a strong moral view of as being like bad and in society what do you do then and I've actually spoken to lots of lots of artists about bearson most artists will just say give him a paintbrush like I won't be like crazy funhouse mirror version of society so I can talk about to make interesting things that feels fine but then they wonder about what happens if if someone gets given his paintbrush and they just want to write checks for a kind of economic purpose they may not know much about the paintbrush they've been given they may not know about straights and then suddenly they kind of unwittingly creating massive scaled up versions of papaya seeds and herring - that thing you gave up but that seems challenging and we're like we used a technology developers have a lot of choice a sort of uncomfortable amount of choice and a lot of problems which are not easy to like fix but you can't fix this you need to sort of bigger I haven't talked about it kimete people aware of it well it's a really clever analogy I have not not heard that one before yeah I mean I think I think it's eat some weird scalability of a lot of this stuff like if we just have tools but let people scale themselves in various directions and the directions are increasingly creative areas because we're building these you know scaled up curve fitting systems that can fit really weird curves simply to get like interesting semantic debate but all the problems of like curve fitting now become we have problems of like a production of parts and sort of force which feels different and challenging I don't have great answers here I have more like oh dear this is interesting and feels like different but actually I mean it's interesting because the like the you know you speak of this sort of like language model is like you know just for example like what if you had a language model but I mean like open a actually like had this issue you know and I'm curious like how you thought about at the time and how you reflect on that now I think at the time so this is GPT - which is a language model will be announced and didn't initially release that subsequent be released in full at the time we I think we made a kind of classic error which is that if you're developing a technology you see all of its potential very very clearly and you don't just see the technology you're holding at your hands you see gen 2 3 4 & 5 and the implications there off I think we treated some of our worries about who misuse of this technology we were thinking about later versions of the technology before what we were actually holding because what actually happened is we release it we observed a huge amount of positive uses and really surprising ones like this game AI dungeon where a language model becomes a kind of Dungeon Master and it feels like interesting and actually different like a different form of game playing something we wouldn't have expected and the misuses were relatively small and it's actually because it's really hard to make a misuse of a technology it's probably as hard to make a misuse of the technology as it is to make a positive news and luckily most people want to do the positive uses so your your amount of people doing abyss uses it's a lot smaller I think that means that the responsibility of technology developers is going to be more about maybe you're still going to kind of trickle things out in stages but you're ultimately gonna go to kind of release lots of stuff in some form it's about thinking about how you can control some elements of the technology while making other parts accessible but can you control how you'd expect a big generative system to be used while making it maximally accessible because you definitely don't want like a big generative model that may have biased tendencies providing generations to people in say a mock interview process that happens before they speak to a human for an interview stage because that's the sort of usage but we can imagine and feels like the sort of thing you really want to avoid but you can sort of imagine ways in which you've made this technology really really really broadly accessible while finally ways to carve out parts where you as a developer kind of say this this probably isn't ok so I because our thinking's become a lot more subtle and I think we did we didn't anchor on the future more than the present and that's been one of the main things that's changed interesting so knowing that you know now you wouldn't withhold the model I think you'd still do staged release but I think that you do role research earlier on characterizing of biases of the model and potential malicious users because I think what we did is we did some of this research and then we did a lot more after some of the initial models to be released on characterizing subsequent models we are planning to release what I think is now more helpful is if you you have a load of that stop front loaded so you're basically saying here's the context here of a traits of this fig which is like going to slowly be released that you should be aware of it and so yeah I think we would have done stuff slightly differently and I think that this is what we're trying to do here is is learn how to behave with these technologies and some of that is about making yourself like more culpable with is traditional for its outcomes because it's a thinking exercise it makes you think about different things to do so I'm glad but part of the goal of GPT to is bring a problem that we actually don't get to get wrong in the future earlier in time to a point where we can like do different ways of release and you know maybe some that'll be good and some will be suboptimal and learn because I think in five six seven years these sorts of capabilities will need to be treated in a sort of like standardized way we thought about carefully and getting to that requires lots of experiments now it's kind of interesting I guess there's sort of two kinds of problems again I think my understanding of the worry with GPZ 2 is actually malicious uses which like more information probably wouldn't help with but then there's also I think like you know your idea of like accidentally racist paintbrush you know like that sort of speaks to like inadvertently bad uses I mean both seem like potential issues but do you now view malicious uses as kind of less of an issue because I really could imagine like a very good language model having plenty of malicious uses I suppose you could say well any interesting technology probably has malicious uses so should we never release like any kind of tool like how do you think about that yeah again it's good what we're doing really easy question a couple of things one of the things we did with GPT too was we release detector system which was a model trained to detect outputs and GPT two models we also released a big data set of unsupervised generations from the model so other people could build different detector systems I think that a huge amount of deal with misuse is just giving yourself awareness you know like why why are police forces around the world and security services able to actually deal with organized crime or we can't make organized crime go away to socio economic phenomenon but they can like tool up on very specific ways to detect patterns of organized crime and I think it's similar here where you need to release tools that can help others detect the output for things you're releasing for avoiding malicious users I think it's actually kind of challenging I think that it's a little unclear today how you completely rule that stuff out I think it's generally challenging to do that with sort of technologies some of how we'd be approaching it is trying to make prototypes the idea being if we can make like a prototype new space that's malicious and real then we should sort of talk to affected people the extent to which we would publicize that remains deeply unclear to me because as you've kind of been sort of interior if you publicize malicious users it's like look over here yes how you might miss universe state we've released which seems seems a little dangerous I think that we're going to need new forms of control of technology in general at some point I don't think that's like this year's problem or next year's problem but you know in 2025 you're going to have these like embarrassingly capable cognitive services which can be made available to large numbers of people and I think sort of cloud providers and governments and other are going to need to work together to really characterize what can be just generically available for everyone and what needs some level of like care and attention paid to it and getting to that's going to be incredibly unpleasant and difficult but it feels part of a noticeable but I guess just to be concrete like if you created sailing at GPT three that was much more powerful you think that you would probably release it along with the detector would be the sort of compromise over I think you think about different ways that you can release it because like some capabilities might find some might you know you might want to have some sort of control so you control the model over people access sort of services around there that could be one way you do it another way it could be just releasing fine-tuned versions of models on specific datasets or specific areas because if you find you know model it's kind of like murals silly putty where you take this big blob of like capability you printed on you dataset it takes on some of the traits of that data set and in some sense you've restricted it so you can do things like that I mean the challenge for a lot of developers going forward is going to be in how to deal with the route like artifacts themselves like the models themselves like here's a thing I think about regularly is it's it's not today it's not next year it's probably not even 2022 but but definitely by like 2025 we're gonna have conditional video models like someone in the AI community or some group of people are gonna develop research but allows us to generate a model generate a video that runs for some period of time you know a few seconds probably probably not like minutes but they can guide it so it it includes specific people and they do specific things and maybe you also get audio as well that capability is obviously something but it's like a much harder case of just a language model or just an every 12 I think with that people he definitely gets like quite a few controls applied to it and needs systems for like authentication of real content on Republic Internet as well like it provokes questions about that yeah I think we're heading we're heading into a weird era for all of this stuff I just I think the advantages you get up releasing all of yourself just sort of publicly of the Internet pretty huge but I also paid for this like to some degree a dereliction of duty by the AI community to not think about the implications of where we are in three four or five years because I have high confidence that we can't be in this state of affairs where the norm is to like put everything online instantly because I think I people just develop things that are frankly like too capable by we I mean in AI research is pretty large bu to be able to do that and say this is fine do you think we're young I need to ask you what is the responsibility of sort of technologists and how do we get to a responsible place necessary and then you could ask me another question for that I don't know it's funny I feel like I really want to want to reserve the right to change my mind on some of this stuff like I feel like yeah I think I think I'm kind of reluctant to like say things publicly because I you know the it seems like actually the ethics really depend on sort of the the specifics of the you know how the technology works and stuff and I think like you know I think like on GPT too is like for just as an example it seemed like you know I thought open a decision was intriguing and like different than I think what I would have done or what my instincts would have been but it was kind of like provocative to say hey we're not gonna you know release this model and I think you know I think the good thing about it maybe it was it kind of got everyone kind of like talking and and thinking about it I guess also another thing that I don't really have a strong point of view on but it was like little interesting is it seems like every it seems like at the moment every AI researcher is sort of asked to be like their own kind of ethicist you know on this stuff like I see like a lot of like you know Ethics documents coming out with you know like even like open source you know ml projects will sort of have like their their code of conduct and on one hand it seems a little it seems a little almost like highfalutin to me like I feel like I have this instinctive like come on like you know like you know should I like put out in like a code of ethics with like you know like the toaster that I sell or you know it seems a little there's something seems a little like unappealing about it but I can actually also definitely articulate the other side of it that if you think I guess to me like it's less like the the power of an individual or and more of just like sort of like if this technology can kind of compound and like you know run amok then you know maybe it's a case that you know people really should be thinking about it but yeah it's honestly I don't know and I don't even know I guess I'm curious what you think about this cuz you're like in this all the time do you think that AI researchers are in the best position to decide the stuff I mean if it if it really affects society's profoundly as you're saying it seems like kind of everyone should should get a say about how this stuff works right yeah so this is unfit right what's actually happening here is an unfair feed for AI researchers which is that they are building powerful technologies they're release them into a world that doesn't have any real notion of technology governments because it hasn't really been developed yet and their release women to systems but we'll use the technologies to do great amounts of goods and and maybe a small amount apart and so the challenge is like well I didn't sign up for this like I wanted to do AI research I didn't want to do like AI research plus societal epics and geopolitics that's also not my expertise I think that's a very reasonable point unfortunately there isn't like another crack team of people hiding behind some bull two entirely sholden bergna fists there are ethicists and social scientists and philosophers members of republic governments all of them have thoughts about person should be involved but I think the way to view AI researchers is they're making stuff that's kind of important they should view themselves as being analogous to engineers of like the people who build buildings Mitchell bridges don't fall over you have a notion of ethics chemists you have a notion of ethics because chemists get trained how to make bombs and so you kind of want your chemists to have a strong ethical compass so that most of them don't make explosives because until you have a really really resilient and stable society you don't want lots of people able to be really have sort of no ethical grounding because they might do experiments that lead to literal blows or you know people like lawyers who have codes of conduct in their base it's very strange to look at AI research and sort of more broadly computer science and see a relative lack of this when you see it in other disciplines that are as impactful or maybe even less impactful on our current world and I don't think any a young century is going to solve this on their own but I think for the culture of culpability of thinking but actually to some extent I am like a little responsible here not a lot it's not my entire problem but I have some responsibility is good because how you get systemic change is you know millions of people making very small decisions of their own lives it's not like millions of people making huge of optional decisions because that doesn't happen at scale but millions of people making like slight filters is how you get massive change over time I think that's kind of what we need here hi we'd love to take a moment to tell you guys about weeks and biases weights and biases is a tool that helps you track and visualize every detail of your machine learning models we help you debug your machine learning models in real time collaborate is 'le and advanced the state of the art in machine learning you can integrate weights and biases into your models with just a few lines of code with hyper parameter sweeps you can find the best set of hyper parameters for your models automatically you can also track and compare how many GPU resources your models are using with one line of code you can visualize model predictions in form of images videos audio philately charts molecular data segmentation maps and 3d point outs you can save everything you need to reproduce your models days weeks or even months after training finally with reports you can make your models come alive reports are like blog posts in which your readers can interact with your model metrics and predictions reports serve as a centralized repository of metrics predictions hyper parameters trade and accompanying notes all of this together gives you a bird's-eye view of your machine learning work though you can use reports to share your model insights keep your team on the same page and collaborate effectively remotely I'll leave a link in the show notes below to help you get started and now let's get back to the episode well let me ask you another easy question what do you think about military applications of AI I think that well the military applications of AI on special in the sense that it's technology but it's going to be used kind of generically to different domains it'll get used in military applications I mostly don't like it because of some of what I think of is there like a p-47 problem so you know the ak-47 was a technological innovation to make this type of rifle like more repeatable more maintainable and easier to used by people who had much less knowledge of weaponry than many prior systems you developed this system it goes everywhere it makes the act of like taking life carrying out war cheaper and more repeatable massively cheaper and much more repeatable and so we see a rise in in conflict and we also see that this artifact this technical artifact to some extent like driest conflict it doesn't create the conditions for conflict but it gets injected into them and it create an it worsens because it's cheap and it works and I think that AI if applied sort of wrongly or rationally in a military context does a lot of this it makes certain things cheaper certain things more repeatable and seems really really bad I think AI for military awareness is much more of a kind of gray area like lots of some ways in which unsteady piece sort of holds in the world is by different sides you your award each other having lots of awareness of each other awareness of troop movements distributions what you're doing and they use surveillance technologies to do this and I think you can make a credible argument that the the advances in computer vision but we're seeing that's being applies like massively widely may if if adopted at scale by lots of militaries at the same time which is kind of what seems to be happening may provide some some diminishment from a certain type of conflicts because it means there's generally more awareness I think stuff like the moral question of lethal autonomous weapons is really really challenging because we want it to be a moral question but it's also mately going to be an economic question like it's going to be a question but governments make decisions about on the motivation of like economic speed and decision and what it does the strategic advantage which means it's really hard to reason about because neither you or I make these decisions and actually accommodate with like a radically different frame probably of like a strong intuitive push against from it existing but that's not the frame of these people right right let's do is oh my ass dude what else you got Lexi okay this is maybe like a less um a less loaded question but I'm cool I'm actually like genuinely curious about this so you know you recently put up this paper I think it's called towards trustworthy AI development and I thought the you know as someone who builds a system that does a lot of saving of experiments and models and things like I thought is really intriguing that you picked as like the subtitle mechanisms for supporting verifiable claims so it seems like you draw this incredibly bright like direct line between you know trustworthy AI development and supporting verifiable claims and I was wondering if you can sort of tell me why that that that is so connected well it's it's really easy for us to savings that have immoral or unethical kind of value and in words committed organization to something like we we value you know the safety of our systems and we value them not making you know biased decisions or or what have you but that's an aspiration and it's very similar to a politician on the election campaign trail being like well if you elect me I will do such-and-such video I'll give you as money or are like I'll build this thing but it's not very verifiable like you're sort of needing to believe the organization or believe a politician and they can't get much proved to you because a is going to be really really significant in society that's going to play an increasingly large role people are going to approach it with slightly more skepticism just as they do with anything else of their life but plays are like large role and aspects of them and they're going to want systems of recourse systems of diagnosis systems of sort of awareness about it now today for most of this we just pull back on people we fall back on like the court system you know as a ways you'd like insure stuff very viable we have these mechanisms in the law that mean that if I as a company make a certain claim you know especially one that has a fiduciary component the the sort of validation of that plane comes for loan and stuff around my company and the ability to verify it comes from action and also legal recourse if I'm not doing it tons of stuff like that but I guess like what but just before they like you like this because some people will not have at the paper listening this silly when you say like supporting verifiable claims like what's an example of like a claim that you might want to verify that would be relevant to trust for the area development is that say our system is we feel that we've like identified sort of menu for main fire season our system and have labeled it as such however you know we we want the world's sort of validate but our system lacks lies in the critical area so we're going to use mechanism a bias bounty to get people to compete to try and find biased rates in our system and survey you've got to be you're making a claim about it I believe that it's you know relatively unbiased or I've taken steps to long for bias in it but then you're introducing an additional thing which is a sort of transparent mechanism for other people to go poke holes in your system and find bias easily and that's going to make your claim more verifiable over time and if it turns out that your system had something like huge trading cattle swatted well at least four mechanism helps you identify too many various rate from there similarly we think about the creation of like third-party auditing organizations right so basically you could have an additional step you could have I have a system making some claim about bias putting a biased bounty out there so I have more people like hitting my system but if I'm being deployed in a in a critical area and what I mean by critical is you know a system that makes decisions that affect someone's financial life so you know any any of these areas where policymakers really really care about then they can say okay my system will be audited by a third party when it gets used in these areas and so now like I'm really not asking you to - believe me I'm asking you to believe like the results by public County and the results of this third party auditor and I think when all of this stuff kind of stacks on itself and gives us the ability to have to have trust in systems other things might be I will just you know I will make a claim about how I value see but the mechanism by which I will be trading my models and aggregating data will be using sort of encrypted machine learning techniques so there I've got this claim but you can really verify it because I have an auditable system that shows you kinda sort of preserving your privacy while manipulating your data and so the idea of this report is basically producer loaded mechanisms but we and a bunch of other organizations that people think are quite good and then the goal over the next year or two is to have organizations who are involved in the reports and obviously weren't implement some of these mechanisms and try them out and we'll be trying to do this with oh so I can join the red team - yeah I think like so obviously having we recommend a shared red team that takes a little bit of unpacking because obviously if your two proprietary companies your red team's can't share lots and lots of information about your proprietary products but they can share the methods they use - like Red Team AI systems and making standardized on some of those sort of best practices that kind of thing feels really really useful because eventually you're going to want to make claims with your red team the system and it's going to be easier to make a trustworthy claim if you use a kind of industry standard set of techniques that are well documented that many have done but if you just sort of cowboy it and doing yourself so yeah please join the red team we want lots of people on like some shared red team infrastructure eventually but the red team infrastructure is actually it seems like the way you describe it and I'm sure this comes from security but I just I'm not super familiar the field it's like you have someone like internal to organization right like you we have an internal team that that tries to break or tries to find problems with you have that and then you're seeking to find ways to have your internal team share insights with other people at other organizations now they can't say here's of a proprietary system I broke and what I did but they can say when I like to sit down and crack my knuckles and try and like red team an analysis that here the approaches I use spective we not in red teaming but we had actually done a little bit of this and open the eye we're in a GPT to preserve people we wrote about some of the ways we try to probe the model for biases because we think that this is an area that's generally useful to especially useful to get standards on and then since then we have just been emailing our methodology to like lots of people at other companies these people can't tell us about the models from their testing providers but they can look at their like probes we're suggesting and tell us if they seem sensible and so that shows you how we're like able to develop some shared knowledge without without breaking sort of proprietary stuff interesting do one thing I kept kind of thinking is as as I was as I was reading your paper is like I use all kinds of technology that I don't think has made verifiable claims like I mean I feel like I rely on you know all kinds of things to work you know and maybe they're making claims but I'm certainly like not aware oh well I sort of assumed that internet security works I assume that you know I now have like all these things plug into my home network that could yeah but I just sort of what do you think that it sort of seemed like these might be just sort of best practices for developing any kind of technology or do you think there's something like really AI specific within it and where would you even like draw the line where you would sort of call something yeah that sort of needs this kind of treatment I think some of it comes down to when when you draw a line I think a I saw is basically when you cross through a technology but can easily be sort of altered and analyzed and half the scope of its behavior to find to a technology where you can somewhat orde it and analyze it and sort of list out where it'll do well but you can't fully define its scope and I think that a lot of like just sort of once you train in your on that you have this like big like probabilistic system but will mostly do certain things but it actually has a surface area that's inherently hard to you characterize fully it's very very very difficult to like fully list it out and mostly it doesn't make a huge amount of sense to you because only a sort of subset of the area of the service area of your system is actually going to be used at any one time so it does have some kind of differences or you know bias counties right is a kind of weird thing it's sort of equivalent to saying all right before we elect this like mayor or before we appoint this person to an administrative position we want a load of people to us from a ton of different questions about quite abstract values that they may or may not have because we want to feel confident that they reflect the values what we'd like someone to have in that position that feels different actually it feels a little different like normal technologies and it would be observed to expect we get to a world where everyone verifies it replay they make all the time because you have the time you know I mostly go through my life depending on on my own belief that other people are sticking to the rules of the game but we all have some cases where we want to go in on something that's happening in her life and oiled it every single facet of this and I think the way to think about why you need verifiable claims or ability to make from quite broadly is as government's consider how to sort of govern technology and how to let technology do the most good while minimizing their the harm it's probably going to come down to the ability to verify certain things in certain critical situations so you're kind of building a little bit stuff not for the majority of your life we pull the really narrow edge cases where this has to happen but necessarily that means you need to build quite general tools for verification and then try and apply it with specific areas it's interesting that why don't it seems like there's been a lot of sort of complaining about AI research recently that a lot of the just the research claims which are maybe not so loaded and not so apply to we don't interact with are actually not really verifiable yeah I mean some of these things are just because there is a computer gap there is like a minority of organized large amounts of compute varies a majority of organizations and a huge swath of academia if not all of academia but has very real computational limits and this means been at a really high level you can't really validate claims made by a subset of the industry because they're doing experiments at scales which you can't hope to meet so some of this is about what one of really general tools we can create just resolve some of these kind of a symmetries of information because some issues of verifiability or less about your ability to verify specific thing at that moment it's more about having enough kind of cultural understanding of where the other person is coming from that you kind of understand what they're saying of a premise behind it and can trust them which is less you demanding a certain type of verification but being like okay well you know you're a complete alien to me you come from another cultural context or another you know political ideology however we have this sort of strong shared understanding of this one thing but you're trying to get me to believe you about and right now if like certain organizations wanted to motivate academia to a certain type of research it would depend on I come from this like big compute premise land and I'm asking you to hear me when I list out a concern that only really makes sense if you've done like experimentation of my scale because that's calibrate my intuitions so we need to find a way to give these people the ability to have the same conversation so that you can sort of improve that so are you gonna give them a ton of compute like what's your participation we basically specifically recommend for Bay but governments fund cloud computing which is a bit difference - it's a bit wonky right but well one thing you need to bear in mind is that today a lot of the way academic funding works sort of centers usually on the notion of having some bit of hardware or capital equipment that you're buying and as we know like that stuff depreciates faster than cars it's like we're bye you're a researcher at an academic institution you'd be much better place to buy like a cloud computing sort of credit or system system that lets you access a variety of different clouds work generally when we go and work with government pushing this idea but they should fund some kind of credit that backs onto a bunch of different cloud systems because you don't want the government saying all right all of America is gonna run on like amazon's cloud but it's obviously like a bad idea but you can probably create a credit which backs on to the infrastructures of like final safety large cloud entities and deal--but requested concerned family and I think this is surprisingly tractable it's like some some policy ideas are relatively simple because they don't need to be any more complicated and so we're kind of lobbying the lack of a better word governments to do this I think the other things bear in mind is that lots of governments because they've invested in supercomputers really want to use supercomputers as their computer for academia and that mostly doesn't work you actually mostly need a dumber simpler form of hardware for most forms of experimentation so you're also saying to government's like I know you spent all of this money on this supercomputer and it's wonderful and yet it's great at simulation you kill our weapons whether you don't need it for Miss stop trying to use it for this like exclusively so that's also about some nice encounter though that's an interesting feel like we've spent untold billions on like having the winner of the top 500 list and we're in some pitched geopolitical war with China like of course you want to use this for AI and you're like yeah but look some people just want like an ATP server actually most people are fine with that so you and this big is not like easy to like multiplex and sample out to people compared to like AWS or Microsoft or interesting well we so we're a little bit running out of time and I asked you I'm curious but we always end with two questions there I'm particularly just it in your point of view on this so yeah the first one I mean and you actually you really view a lot of things going on today I I mean from your vantage point at open air and then also the newsletter that you put out so what would you say is like the topic that people don't pay enough attention to you like the the thing that like you know it's just matters so much more than people compared to how much people look at it I think the thing that no one looks at for really matters is advances in just a very niche politic computer vision which is the problem of re-identification of an object or a person that you've seen previously and what I mean is that our ability to do pedestrian reai densification now is is improving significant it's stacked on all of these image net sort of innovations in steps or more ability to do rapid like feature extraction from video feeds it's stacked on like a load of just interesting components innovations and it's creating this stream of technologies that will lead to really really cheap surveillance but eventually is deplorable on edge systems like drones or whatever by anyone and I think what we're kind of massively under estimating the effects of that capability because it's not that interesting it's not an advanced it doesn't even require like massively complex like reinforcement learning or any of these things that researchers like to spend time on it's just a sort of basic component but notice the component that supports surveillance states and authoritarianism and balance the component but can make it very easy for an otherwise sort of liberal government to slip into a form of surveillance and control that no one would really want - ha and I'm actually thinking about yeah can I write like a survey or something about this because it's not helpful for someone like open AI to warn of this it's sort of a wrong message it's may be okay for me to write about a kekkaishi of my newsletter as I do but I sort of think about writing an essay like has anyone noticed this if I gather all of the scores right I look at all of the graphs and stuff and speak by folder of it it's all going up like it's all great hockey stick it's all getting cheap yeah so that's a cheerful but I think it's important yeah well great answer as expected all right it's a good question which we always ask and normally we're we're talking to kind of more industry practitioners but maybe can apply it to open air so when you look at the demo projects that you've witnessed and like Oprah has actually had some really spectacular ones um what's the like what's the part of sort of like conception to to complete that looks the hardest here and maybe them the most unexpectedly difficult piece of it like sort of watching you know like solving dota or being there's even a donor like GP to like what like where do things get stuck and and why good question um I think there may be two parts where projects get like stopped all have interesting traits one is just data like they used to really want visa to not matter so much and then you just look at it and realize that you know whether it's like dota and how you ingest data from like the game engine there or robotics and how you choose to do like two main randomization in simulation or supervised learning where you're just figuring out what data sets they have and what what mixing proportion do I give them during trading and how many buttons so I do that just seems very hard I think others have talked about this it's not really a well-documented science it's something that many people treat with intuition and it just seems like an easy place to get stuck and then the other is testing once I have a system how well can I to rise it and and what sort of tests can I use from the existing research literature and what tests do I need to sort of build myself like we we spend a lot of time to figure out new evaluations at open AI because for some systems you want to do a form of eval that doesn't yet exist to characterize performance in some domain and figuring out how to test for a performance trait that may not even be present in the system is really hard so most of you two areas yeah okay I can't help myself actually as you're attacking I I find him like one more question they watch I'm sorry to do this but I've won so like I feel like the people that I know or that I've like watched closely at opening have been actually spectacularly successful and and like you know they've been part of projects that have really seems to me have succeeded like the the robot hand doing the Rubik's Cube and dota are they're like a whole bunch of products that are a project that we don't see that I've just totally failed universe that was sort of a failure we tried to like we tried to build the system which is kind of like opening I gin but the environments would be every flash of HTML game but have been published on the internet and they said yeah so that failed right that failed because of network asynchronicity and so basically you ended up having because we were sandbox and the things in the browser and you had a separate game engine with you to go and talk about the network to them all rail actually isn't really robust enough to that level of like time jitter to do useful stuff so that kind of didn't work and so we have some public failures which are because it's kind of yeah we have so some kind of private ones a lot of it is you know some people just spend a year or two on an idea but then it's not not working out some people and I won't name the project as its public but maybe they came up with a simple thing but worked really well and we spent six months trying to come up with what as a researcher before it was the board discipline more like better approach to it and the simple thing was work all of these they tried to eventually published your system with like a simple thing I know like yeah it works but I don't know you but rather let my complex idea works um but we don't like out big bets like a hand or don't or all GPT those intended to go okay and that's usually because they've come from a place of iteration like dota came from prior work applying PBO and I think evolutionary algorithms to two other systems the hand came from prior work on just like block rotation right so once you can do block rotation you can do a remix TPT came from prior work on scaling up language models just of GPT one so a lot of it's just happened sort of iteratively of a public generic but yet we don't have like we don't have an abnormal lack of failure nor an abnormal amount of success like because it's pretty pretty in distribution I know okay yeah yeah thanks all right [Music],9630 +Rachael Tatman — Conversational AI and Linguistics,https://www.youtube.com/watch?v=n_CTGZSq4m0,2211,2020-04-06,hi I'm Lucas and you're listening to gradient descent we started this program because we're super passionate about making machine learning work in the real world by any means necessary and one of the things that we discovered in the process of building machine learning tools is that our users and our customers they have a lot of information in their heads that's not publicly available and a lot of people that we talk to ask us what other people are talking about other people are doing what are the best practices and so we want to make all the interesting conversations that we're having and all the interesting things that we're learning available for everyone out there so I hope you enjoy this today our guest is Rachel Chapman who is a linguist and developer advocate for Raza who helps developers build and deploy conversational applications using an open-source framework before that she was a data scientist at kaggle where she was also a cago Grandmaster she also did really interesting work at the University of Washington as part of her PhD in linguistics yeah thank you for having me oh my pleasure the place where I was hoping to start if it's all right is kind of your experience at Cal just because I think that's such an interesting website with like so many just like so many interesting learnings about machine learning I feel like a lot of like kind of new stuff it kind of happens on cago first sometimes and the the insights that folks have are so interesting could you just maybe tell us a little bit about what your experience that at coggle was like and kind of what you learned yeah so I was at Cagle for two and a half years and I think most people who are familiar with Cagle familiar with the competition's which are generally supervised machine learning competitions where everyone's working on the same problem with the same dataset and I never actually supported competitions directly so I worked on the dataset hosting platform and I worked on the host to Jupiter notebook environment that cago develops which is called notebook skagle notebooks at this point it was called kernels for a while because you can also have scripts and also the forums for developers to talk with each other and the learned content I also worked on a bit as well so learning is kegels um sort of machine learning courses they're becoming more sort of structured and fully featured over time so I worked on all those parts of the website I worked a lot with the community developing educational content making product recommendations so one of the things that I'm I'm most proud of is I mentioned we had scripts that were sort of flat hosted Python or our files or our markdown and you also had notebooks but for a while if you had a module you're working on a script there was no way to use that module in a notebook so I worked with the engineering team to sort of spec out what it would look like to have importable scripts and now you can do that it was built out and was I think pretty successful so it was coggle your first kind of industry job coming out of a PhD it was so I started right after I graduated and I was actually the time I was still applying for faculty positions as well so I had was in the sort of limbo or I was working a kegel and I was also the faculty job market for academic positions and I found that I really enjoyed working in industry and the things that I would have liked about a faculty position so the teaching and helping people build cool things my preference would have been more language he thinks which is that I'm at Rosslyn now I had a lot more reach and impact ik haggle than I would have had in a university setting so I found that really appealing were there any kind of um things that you have to get used to like kind of going into an industry job out of academia like I feel like that's a hard adjustment for some people but it sound like you you really enjoyed it I did I think one of the biggest changes for me I know a lot of people talk about sort of the pace of work that the pace of academia is much much slower and if you've ever been in industry or academia trying to collaborate across across those fields it can be a little challenging but for me I think a bigger change was that the North Star of what success looked like would really change very dramatically so I kind of was a startup that was acquired by Google and I had still sort of had and still sort of a start-up mentality of iterating fairly quickly so when I first started my focus was really on trying to increase the number of data sets on the dataset platform and we found eventually that was worth happening or and then my focus changed to helping people write write notebooks goggle up books using the hosted platform and then to just sort of growing the community and helping them grow their skills whereas in academia you know you need to publish the top-tier conferences and journals I guess show up for the class you're teaching they would prefer that it's not the effort isn't really rewarded for most most tenured tech jobs and you know that in order to continue to advance you need to have the highest number of high-quality publications possible basically so that North Star of what success will look like for you in academia is very very static and in industry at least in my experience it has been much more variable - I wouldn't call it a North Star I'd call it like a common north planet so that like a frustrating thing or were there good parts of that yeah I was gonna say I like change I don't think that that's like a categorically Joe um but I enjoyed the challenge of things changing relatively quickly and it made things feel very fresh and attainable to me and also the goal posts are much much closer so you know I can be working on a single research paper for years and years and years and years whereas projects that's a long-term project in an industry setting would be like a quarter that would be like a big ask at least again in my experience it might be different in different companies that's what eyes what were the kind of goals of Congo's it's to sort of increase the engagement of the users then like you wanted more datasets but like what do you think like are sort of the big goals they're like I know goggle really helps a lot of people kind of get into machine learning and they've made I mean so many kind of open datasets the kernel stuff is so cool with the collaboration like how do you think about that or what was the what were the big kind of goals I guess yeah I think the the big goal of Cagle is to really help all data scientists with their work and I mean I don't I'm not the company anymore so don't really know there's the new term goals at the moment but sort of all of the different things that Kyle is doing are in service of that you know higher goals to help people basically get better at job and then do their job successfully so that definitely Clues like people who are brand new to the field sort of sort of starting to try it out and get their first steps and also people who are really advanced in the field and want a challenge he certainly has most professionally to scientists machine learning engineers now you don't spend most your time building model that's a fairly small slice of the data science workflow but I think it is for many people with true data science and machine learning in the first place so having a place where you can go and just like you know just have the part of the task that you really like doing in a very challenging way I think is really appealing to people you know practically XD boots will look some sort of radio boosted model will work for most things it's fast it's cheap to run like probably that's what you're gonna be doing day to day and you don't really need to you know get much fancier so having a place to let go and cut loose as well cool you've been at Reza for about a month you said yeah I guess coming up on two months yeah cool I don't know what it what is resident so Rosa is a startup that has an open-source conversationally framework so basically to take in text in a conversation figure out the information that is in that text decide what to do next and then take that action whether that's a turn whether that's running some code and then on top of that open-source framework there is a free platform called Raza X and Raza X lets you deploy models you know have have people test your your models annotate data and sort of fold them back in so you have a little bit more of a human in the loop learning process well you're iterating and then if you are a business that wanted to use these tools we have also enterprise which has lots of additional features and this is focused around kind of conversational yes so chat BOTS virtual assistants anything where you would be interacting with an automated system through a text conversation or voice conversation rather than through a GUI or a command line what makes you excited about conversation like what's the heard of the promise of conversationally so I think we've all had probably bad experiences with chatbots in the past there was definitely a period last couple years and people were very excited to try the technology and I think the sort of industry-wide design expertise wasn't there yet so there were lots of I don't know I had frustrating experiences I think there was a study that like 54% of people have had just like a bad chat experience but as design has really matured I think it opens up being able to do computational tasks to a much wider variety of people so sorry we're computational tasks so anything we need to use a computer oh I see so as an example people who aren't literate have sort of limited ability to use gooeys and sort of have to memorize where things are but probably have had conversations in the past and can especially with police technology really interact very naturally with whatever computational system they're interacting with in even just being a computer literacy using a mouse is not I mean if you're a technologist it's second nature to you but it's a learned skill it takes a while to to acquire so being able to provide services and open up you know access to people with a variety of different abilities and backgrounds I think is to me the most appealing part of conversationally a high but also it comes with a second challenge that people who are using conversation when I come in knowing how to conversations work and will always judge a conversational system against a human right because that's you know this conversation is you know high quality if I were having this with the bot my mind to be blown I don't I don't think it's necessarily we're not quite yet so being able to achieve that really fluent level of conversational interaction is a really large engineering challenge in a really large machine learning challenge and do you have like um like an example today a kind of conversational thing that that you can actually interact with that's like a really good experience it's sort of like you would point to is like that's like how things should hmmm I think the most recent one that I had that was really fantastic was actually I think it's publicly available but it was for booking booking time off so I knew the days I wanted my vacation I knew that if I was gonna go through the website I'd have to do like 80 of the things I wasn't really sure how the process works I've never done it before and a co-worker of mine was like hey use this Bob and it was a really fantastic experience it was really well designed so in turns were there were very few possible options instead of having me just like generate text it had buttons so using buttons in the on verse a tional flow made it feel much faster and the whole process was there was sort of like a variety of different things that needed to happen it kept track of sort of the things that I said before so like the dates and the things that I needed and at the end it took maybe two minutes to do what otherwise would have taken me a good half hour so in general I think a good conversational interaction is one where it is faster to do the thing that you need to do than it would be otherwise and you think like things are getting to that point now or it actually as fast it sounds like your expanse wasn't there's a lot faster to go through the conversational interface yeah yeah I think it again we are it's a it's a young field we're definitely we as conversation I people are definitely learning what makes these systems work very well and be very delightful but yeah I think we're getting there I guess I have this feeling that like NLP has made some huge improvements in the last year or two is that sort of like are these things like already sort of deployed in conversational agents or kind of more work to do to make them actually like production ready yeah so we at rasa have recently added an option for Burton beddings which i think is probably one of the things you're thinking of but definitely transformer architectures are there we use them and of course we're open source so if you wanted to use something else you'd be welcome to but we use contextual embeddings specifically conversation elite range contextual embedding so if you wanted to use Burton said you'd definitely be welcome to and that paper was last year two years ago so fairly recent a lot of the more recent work that I think has been a little bit more headline crabby has been around natural language generation so the GPT to stuff Nina which is a Google project that came out relatively recently and I think natural language generation is much trickier to get right and so the sort of the the default set up certainly for aza is that you have a limited number of utterances the Erbakan say you might have some slot fillings you might say hello Lucas I see that you recently went to Vienna I don't know if you've ever been to Vienna do you want to rate your hotel or whatever it is that they're your interaction is uh-huh the tricky thing about a lot of the natural language generation is that it sounds very fluent right it sounds like something a human conceivably could say which is very exciting and not no small feat but it's not grounded so like the GPT to text examples if you look to than there is one we're like oh these scientists discovered unicorns scientists have have not discovered unicorns it doesn't have ties to any sort of knowledge base that is the ground truth that the text is being generated around and I think my worry is that especially people without a deeper understanding of NLP will see these very fluent text productions and be like oh I don't have to build a bot I can just sort of like type the user input into this this predictor that will come up with some sort of text that I should say back and there's nothing to stop it from being completely and factual for me like yes of course we'll give you a full refund on your house there's nothing to stop it from being abusive most of the large language models there are certain sort of adversarial attacks that you can use like small text rings like 10 to 12 characters that will cause it to produce really vile abusive output and that's obviously not something you want to be showing to people hopefully obviously and so I mean this way I would not be comfortable doing a completely neural natural language generation a conversational assistant I definitely would want to have more control over over utterances so the way rosin works is it figures that an intention is that right and then it it sort of fills in sort of like slots is all right yeah so that's sort of the the current approach is to do entity extraction and then I tend to intention identification so intense where you have a set number of intents that you've provided training data for going forward we're combining intense and entities together so instead of having intense instruction as a single part of the pipeline making it a little bit more tied into the rest of the the NLU that's going on which is research that we're working on can you talk about kind of what you think is like hard about making a chatbot work like what what are we good what are like the kind of core technical challenge is to make one and deploy it yeah I mean the the first hurdle is the first of all you'll get with any machine learning project which is getting the data there's a sort of an older school approach which is to build a state machine where you have like okay the person said hi we're gonna say hi the person wants to know how to rent a car or the person wants to know how to find the dealership so we're gonna go down that path okay they wanted your car what type of car did they want so that's sort of like a like a decision tree but for for a dialog agent sure and one of the big challenges is if people are like okay my car actually where's the dealership um being able to recover from some one sort of interleaving other types of of intense into your your happy path that you've constructed fairly challenging for that sort of state machine based approach so one thing that we've done at rasa is we have an intention model attention within a sari not an attention model so instead of having a just like a straight through the the tree you'll you'll start with sample stories so dialogues that someone might have and then it comes to turn pick the next turn if you have an exact example of the specific conversational flow that you've seen before you're just going to continue on that flow because you've you've seen it before you're shirts right if you aren't sure what the intention or the entities or any other required information is then you'll have a fallback we're like could you rephrase that or I'm not entirely sure what you're asking for but here's the next two closest results or that sort of thing and then also a machine learning based policy that ranks the possible responses and says okay I think this one's probably the the correct next one and then those are all considered and if there's one that's highly likely the messily that you go with and if there isn't then you go back to the fallback policy so it can handle these sort of interleaved aside type structures in conversations I'm in a way that the more rigid state machine cannot wait so let me can I repeat this back to and see if I understand it so it sounds like the sort of first thing is just like it's like kind of like a essentially like a state machine kind of rule based system is that yes that's not what underlies Rossa but that's a very common approach to building conversational assistance oh sorry so the first what's the first Frossard she said so we have a variety of policies that are all considered together and then the one that has the highest competence is usually the one that's leptin and what is a policy is that would that be or would that be like a so it's more probabilistic it's selecting that you can think of it as multi-class classification across all of the possible responses and then it will select the one that's most likely based on the training game I see and so the training data would be like in like an utterance or like a like a conversation and and the intent or so two types of training data one are examples intents and an example entity so things that the user would say and these you might just sit there and come up with you might collect them from maybe accuse UConn and so that's more on the NLU side and on the other side to determine the dialogue policy what gets said next you'll have examples of conversations so you'll probably have the one that's like okay this is the ideal the person that's high they want to know what car to do I'm gonna like create the database and get the available cars and tell them the available car is smell that I'll have and then you might have other turns like other possible stories like someone's like hey are you a bot and you're like yeah I'm a bottom it's like I want to talk to a human or whatever and then it helps them out with the thing that they need and then those stories and those example addresses together are used to train the model uh-huh and you don't need that much if you are using a language for which we had free trained embeddings you don't need that much data to get started and the idea is you build a minimally viable assistant and then you deploy it and you have people make test conversations with it make test conversations have test conversations with it how you go back you annotate those you put those back into the training data you retrain and you continue on in that that way so you could add additional stories you could add new intents you could add new audiences you can sort of change your model to fit the actual conversations that you see I see how much training do you think you need before you get something reasonable so for some of the examples that I've worked on you'll probably need you know 10 to 20 examples per intent and then maybe three or four stories I say again because we're using pre trained embedding so you know that like I want a car and I want a vehicle are gonna be similar because the embeddings for car and vehicle are similar so you you have the sort of the fuzziness of machine learning - huh you know maybe you like shifting gears a little bit um you know I know that you write a lot about of papers that you've read and I think like a really common question I get it's kind of like how do I kind of approach papers how do I find like what papers to read mm-hmm you know how do I even like you know kind of go about like like reading a paper do you I would think you'd have some smart advice on that yeah so don't go to archive every day just make yourself upset I don't try to stay on top of things right after they're written so how I will come to know that there's a paper that I want to read is I you know I'm very active on Twitter as you mentioned and so if a lot of people are talking about a specific paper either as they like it or they are not a fan of it either way I will once I have enough people that I trust be like hey this is an important paper from one reason or another I'll so just like read it and my usual approach reading it first of all of it's a really seminal paper like the transformer paper oftentimes there will be a blog or a talk that someone has done and you can read that inside the paper and get the same information if there isn't I would start by reading the abstract and then I like to read the introduction and then the conclusion so I have a good general idea of what's going on with the paper and then after that starting from the top in the related work section or the literature review section sometimes it's at the end I wouldn't go and chase down all of those terms that might be new to you right away so just sort of skim that section go to the methods and if you see terms there that are repeated you saw previously and they look like they're gonna be used a lot in the paper then go and look those up if you're not familiar with them and when you get to math you get to an equation my strategy is always to try and take that and put it into human words like the way that I would say it and in the process of doing that I'm usually like oh I don't actually know what this term is can I figure out what this term is from other places in the text um and that for me that's the part that takes the longest in reading a paper which i think is true for most people unless it's very similar to a field you already work in you're very familiar with sort of the bones of the equation and then from there continuing through I always skim the results because it's usually like look here's our results you know these people's results we got state of the art Azha unless there's something very specific that you're interested in and then I will pay more attention to the ablation results if they have any ablation so ablations are when you have a full model and you start to take parts out of it and you see what changes what so I find those to be particularly pure practitioner if you're thinking oh maybe I can maybe I want to implement this but that's a lot of layers maybe I want a couple fewer layers figuring out what you can get rid of that may be practical in an academic setting but not in a production setting is really helpful if the paper is on open soft I know I CLR is there's a subset of conferences where the reviews are made public so it might be helpful for you to go back and also read the reviews of the paper if it's again something you like but this introduction what are other people saying so the more you care about the paper the further down that list you will go the more you're like I just sort of kind of wonder what's going on the nearer the top of the list I will start that's it what's the paper that you've done like pretty deep fun recently I am like probably 50% of the way through the list for the convert paper which I mentioned earlier so that is a paper that we have implemented into Raza and that is Henderson at all 20 it's a transformer embedding architecture specifically for conversational data which is obviously very relevant to us cool I was looking at your your papers from grad school and I saw you had papers I'm kind of like Twitter and things like that and I remember I remember being super interested in that and earlier Micra I'm kind of curious you have any like favor or paper any a favorite result that you know that's talked about Oh from my work so I think the the result that I would most like to let people know about so I got a lot of sort of traction with my paper that was looking at automatic speech recognition and accuracy across different demographic groups and I two papers one at ACL and then one at inter speech no es yell isn't NLP conference injury speech is a speech conference and the ACL paper was on gender and region and I found differences for both of them but that was using user uploaded YouTube videos and again I was guessing at gender and not particularly I would say ecological valid way so you know the not the absolute best methodology but when I repeated the experiment with additional api's and using higher quality audio where the the signal noise ratio I'd be controlled so basically recorded in a quiet environment high-quality microphone I found that the gender differences disappeared and this for this one I did have self-reported gender from users the demographic regional differences were still there and this time I also had access to race ethnicity data and I found a really strong difference so when signal to noise ratio is controlled you don't have or at least I didn't find that the the gender difference obtained but there was a really strong difference in accuracy from people of different geographic regions so the sort of general American prestige educated upper middle class dialect had the best recognition rates any other regional dialect had lower recognition rates it's own dialect it's interesting because in England there is like a very specific pronunciation set of rules that are considered the standard received pronunciation in RP and anything else is considered a regional dialect in United States you can have quite a bit of variation in pronunciation and still be considered in general American speaker and it seems to be more around lexical items and grammatical features that make you sound not accented everybody has an accent so I would say it's a variety I would say it's less internally consistent of a dialects than a lot of other dialects like um California English for example sounds like you basically found that the I mean I guess I guess this makes sense that the the quality got worse with any variation from kind of this standard or what would you call it the the general american standard american english will you'll hear those terms a lot so sort of the variety where speakers are consciously avoiding using regional forms so both in region but also african american speakers had much higher aires and that's not due to african american english being any less internally consistent or easy to recognize it's almost certainly due to imbalances in the training data so your your classic imbalance class problem right right until did you have recommendations on how to to deal with this or yeah so the gender thing is real there is a gender difference but it's more on the signal processing side and less on the automatic speech recognition sort of machine learning modeling side man on the signal processing said yes so a couple of things so one is that women in general tend to be slightly smaller tend to have slightly lower lung capacity tend to be slightly quieter so for an equal disabled level of noise you'll tend to have a little bit less signal also when we were developing sort of telephony and recording in general the the band that was picked to be sort of the target band for the the frequency band that was picked to be the target band for all systems basically and that a lot of the the speech recognition comes directly from you know Bell Labs and loud so that the telephone work in earlier days was picked to suit a male voice and not any of the other types of voices you might encounter so children also tend to have really high recognition rates partially that's due to children varying more as they're learning the language but partially that's just due to their frequency range not being represented as well but I guess downstream then it actually the the area is is higher mmm so the definitely the regional and racial differences are due to things that you could fix with machine learning whereas I think the other differences are not to use that how were you able to pull that apart I believe I had fairly balanced classes specifically on the modeling side I used mixed effects models so you can control for some feature as well identifying the effects of others what do you what do you think is an underrated aspect of machine learning that you think people should pay more attention to data visualization oh cool I've seen a lot of really excellent machine learning engineers who have a hard time communicating their results in models because they're their charts are just unreadable is anything come to mind where you like so I'm really good like something you want to call out is like an excellent so if you want to see like some master classes and visualization the pudding which is oh yeah actually a journalist thing I guess there's really really stunning visualizations that just sort of push the limit of art so data visualization one of my biggest pet peeves is you should not use lines to connect points unless there is a logical reason to do it like the time series or like that something could exist in the space between the two points don't do it for categories drives me nuts up the wall one question the how do you feel about 3d visualizations are you in a are are you presenting them in a way that people can like walk around them like I'm not against them in general I just find them harder to part it apart I mean there's there's a whole field of study that specifically looks at how humans process information and what is most useful for conveying different types of information visually please please read any visualization papers y'all what do you think is the biggest challenge for making machine learning work in the real world right now it's not really a technical challenge but I think the biggest thing that trips people up is deciding to build things that don't need to or shouldn't exist I understand it is a very exciting time to be in machine learning we all want to work on fun projects and change the world but particularly if you are building anything that would deal with a sensitive or a vulnerable community I would highly encourage you to reach out to people from that community and work with them and make sure that it's something that does need to exist and that you are building it in such a way that it's actually useful and so an example that comes to mind is there been a number of projects built by people who are not deaf and who are not signers to help deaf people communicate and usually they take the form of gloves or computer vision to take sign language and turn it into a different language that's not usually the problem in in speaking deaf communication situations usually it's that the speaking person does not have a very good way of communicating their intent in general deaf people are masterful communicators and do not struggle on with getting themselves understood and so that's just an example of like I get it I understand that it's an exciting project and you are very passionate about it but before you spend a lot of time and money and resources building something make sure people want it good advice [Laughter] cool I think that might be a nice place to to wrap up Rachel thanks so much for your time I really appreciate it yeah thank you all right that is such a good conversation thanks Lucas and Rachel I'm gonna drop Rachel's Twitter account and some other links in the show notes below I highly recommend that you guys check her out she does these really cool live streams on NLP that are really worth watching also before we go I'd love to tell you guys about how weights and biases can help you get to the Kaggle leaderboard faster so weights and biases is an experiment tracking and hyper parameter optimization platform so what we let you do is track the performance of your models in real time so you can try different experiments try different model architectures try different hyper parameter values and see how your models are doing in real time we also let you log their outputs the predictions of your models so if you're working with images videos audios or you can see here we're logging protein structures you can actually see how your model is performing at every epoch and be able to debug it really easily we also you see how your model is using its resources so you can see the GPU memory usage and all that kind of stuff and you can compare for instance the effect different batch sizes on GPU usage you can also run hyper parameter sweeps very easily so for example you can pass a dictionary with a range of different hyper parameter values that you'd like us to try and will automatically particularly run all of those different models for you and show you which of those hyper parameter values did the best we also but this type of parameter important splat we assure you which of the hyper parameters were the most important and how their values were correlated to the metric that you care about finally integrating weights and biases with your models is very simple you just import overwritten bases you initialize a project and then you can start logging any metrics that you want we also have integrations for cares psychic like GBM extra boost and many other different kinds of models and if you want to get started really quickly building this page down below and you can get started with weights and biases in five minutes and that's all I have for you guys for today we'll see you guys next week with another great episode,6563 +Nicolas Koumchatzky — Machine Learning in Production for Self-Driving Cars,https://www.youtube.com/watch?v=NbiG8ZuRsqU,2696,2020-03-20,hi I'm Lucas and you're listening to gradient descent we started this program because we're super passionate about making machine learning work in the real world by any means necessary and one of the things that we discovered in the process of building machine learning tools is that our users and our customers they have a lot of information in their heads that's not publicly available and a lot of people that we talk to ask us what other people are talking about what other people are doing what are the best practices and so we want to make all the interesting conversations that we're having and all the interesting things that we're learning available for everyone out there so I hope you enjoy this today our guest is Nicolas come chatzky who is currently a director of AI infrastructure at Nvidia and before that ran Twitter cortex and was one of the first people to put deep learning models into production at scale so Nicolas thanks so much for taking the time to talk with us and I mean you're an expert on deploying deep learning in the real world and I would love to it you know let it kind of just hear how things have changed since you've been doing it I mean I think you started and doing this in like 2016 or or maybe even earlier Twitter like you know kind of what were the challenges then and and where the challenges now that you're seeing and making these models actually work Thank You Lucas I started like learning by the way about deep learning in 20 2014 so I'm not one of the old school I also do clowning but then I get hooked up pretty quickly yeah and yeah then I started a small startup with like you know five people or something I did and we were acquired by Twitter at Twitter we started this first deep learning team basically and so a Twitter we we we I mean basically Twitter didn't have any deep learning knowledge like they're not very eater and so we were associated with software engineers there you know that productionize deep learning on some product areas that could benefit from it you know what are the first areas where they felt like they could get a benefit was it vision stuff yeah kind of a mix mostly vision yes and text so we started with two main projects one of them was filtering kind of like bad bad image content you know haha so that was one and the other one that was more like a good a good product feature basically was deciding whether a user profile was safe to place ads on it or not and this is a big deal for that advertisers because they they wanted to make sure that they could put ads on profiles and make sure that these profiles are not like toxic or you know insulting or or like all kinds of account that you don't want to add next to and so we were able to classify those profiles using the text in the images using user features as well which allowed us to put ads on products as you can imagine this was revenue generating quite a bit and so yeah yeah and so so that was kind of like the beginning of a and so is this shifting from and it like an existing like kind of more traditional model to a deep learning model or like a new a new deployment of like a new problem no in those cases I mean they had been interest in those areas but but without deepening it was almost impossible to perform at the require accuracy so for example advertisers expect like 99.9% accuracy right it was unachievable just using tabular features and and like decision trees I mean I think it would be doable if one put the effort but much more complex and I guess like these sound like applications that you could do is sort of a batch process in the background like it doesn't need to run live on user queries or does it that's true doesn't doesn't doesn't have to you're right except potentially for for featuring of images so whenever a user of post an image make sure that those images are kind of like hidden right away for example you know for certain categories of user one thing that required a lot of a real-time processing was sketching kind of like very bad images I'm not going to go into details but and we wanted to redo that in real time before they hit the platform basically so in that case we would have a budget of like a hundred milliseconds maybe off even less than that right to be able to to get them it's like 20:16 so happy like how did you yeah get that working in real time in production like what right so that was those yeah interesting I'm sure you know the details of like typical in frame ones like the back then but basically there was a an O and blue a dot and so we were using okay Oh a cafe that's true cafe yeah yeah forget about caffeine we won't use in cafe although it would have it would have been a pretty good solution that case what you think you were talk and so what we did from weather training it's great however for deploying to production it's kind of a more difficult to what we had and and and wrapped it up into scalar services yeah that was that was so much a father so much have fought to make sure he was walking stable and so on so forth we run torch then in production or did you like a silent yeah yeah we were running across the production yeah Wow yeah that was a lot of a fault right so require so much engineering and you have 50 milliseconds to make a decision that sounds like a real real feat were you doing that or were you kind of handing it over to a different team no we were we were doing it internally right no no we we had to because it requires a lot of required a lot of expertise but that's right and what could make them go faster or slower the batch size all those things right so I said to do everything in the house like retraining live to or how did that work so we did that but later so that was the first part where we use deboning only for for images and text mostly right we get at Massif content for example and so on so fast but then after a little while we started looking at other problems that were more fundamental Supriya like ABS basement haha time is ranking things like that that are more tabular based right so like using user features item features and trying to make the best prediction of whether user is going to engage with some content and we also managed we use deep learning for that too and we managed to get better results than traditional techniques basically and so and so the reason why I started with all of that it because in that case for example for a displacement it's very important to have access to the latest and greatest features also to do online learning what we call online learning which is yeah like learning continuously or like with very high frequency because otherwise there is a quite like quick decay of performance right the decayed health life I don't know maybe something like one to three minutes right something like that and in the middle starts to decay yeah so we had to do online learning for that yes Wow see you would retrain every every minute so we we so there are multiple ways of doing it one way to do it is just to do online learning right away so just keep training yeah online you could you could also like freeze some of the layers and only retrain the last logistic regression for example ah but is this is the easiest one actually we can learn some kind of like why you know I know if you're familiar with wide and deep architecture for example huh yeah but yeah you could you could you could use the like memorization that I which is usually wear the DK happens and keep retraining that one basically and keep the other one a constant or I think some companies do that I think Google does that for exam I'm not so sure but they retrain regularly like every five minutes maybe so they have they take the existing model in prod fine-tune it really ploy function if Reaper yeah you ever go back and like sort of retrain it from scratch or is it always just sort of like online yeah we do mostly if we want to add new features already can you change the middle architecture right yeah how do you even um I guess how do you even evaluate then if like a new architecture is gonna be better like it seems like that would be kind of tricky right like yeah so we have to simulate the fact that it's online learning basically until in that case we there's to be like a time period where we say okay we stopped training and we look at everything that's after and then we we can we can evaluate by keeping learning you know it's possible to simulate the situation basically right yeah it's it's more infrastructure okay yeah Wow and so I mean this must have been an incredibly high amount of compute it's yeah pretty high pretty high back then yeah yeah I mean all interestingly old CPU dough I'll see you yeah Oh CPU because because GPUs well not I mean I work at Nvidia now but they were not that that easy to use in that context there was less tooling now it's changing with I know if you heard about rapid for example uh or so basically data science accidental GPU a lot of libraries available now but back then there was none of that so we had to just we accidentally code on sleepy you but you make it really really fast yeah with there other pieces of interest infrastructure that you had to build to get this working and in 2016 oh yeah 2016 well so you mean besides the inference ali actually yeah yeah I think for we had to father for the training part one of the challenges we had is that a lot of the deliver gas terminals were used to decision trees and certain API and configurations they were not familiar always do a touch few people are familiar with Lua in general so we cannot hide had to hide this and they're like you know configurations so we build infrastructure to basically simplify their life such that we could copy paste configurations right and just specify their features be seriously like ok these are the features I have this table like the steps I wanna run you know like training validation whatever and then and then basically uh yeah basically like automatically saved a mother and so so we brought a lot of like automation in the training phase at the cursor flexibility at the beginning then it changed but yeah and how did it change so then then once the company started realizing diya impacting the importance of this at Twitter yeah they did I mean we like they decided to also hire people who could understand and really invest in education that was that was one and at the same time we decided to I guess the centralized machine learning platforms in basically to move to tensorflow white and saw flow because bags and fighters are still very kind of like new and unstable not even 1.0 i think and so moving to tensorflow which also had like an inference story right which doesn't have and so on so fast that's what are so many recommender systems using terms of flow those days right because they have like this compute story anyway so we move to we move to tell pretty quickly after that for training and interest yeah I see and or training it is not enough reference known part of it just the library part you know just the the C++ library part but the thing is Twitter has their own Bella formats and sterilization formats so they had to play with that so for example thrift instead of a instead of any of them I guess are there any other sort of like surprisingly challenging things at that time like stuff that like no maybe like academics or people that don't work in these sort of large-scale deployments wouldn't know about like any other like tricky pieces so so you made it from from from from Twitter people yeah Twitter yeah I think I think in some parts there was a disbelief about around deploying no this is I don't like that's what you're asking it says exactly but and I think it still exists in the medical medical field for example well people asked for for interpretability explained ability and so on so forth yeah and even at the at the expense of better performance better performance but eventually that you know that it's so funny in in like 2005 I worked Yahoo moving models I'm like rule-based ranking systems to to boosted trees huh yes you know they had all the exact same complaints like the people were like oh these malls are not explainable they're like impossible to deploy like it's like exactly the same but now they're the same complaints about moving away from this issue yeah exactly well the difference for infrastructure is that we replicated almost the same API says when they as what Twitter heart for decision tree so it was a little bit easier there was already kind of ml ready right was in your case I guess it was like completely different right I'm strong yeah but it sounds like you had kind of add some weird components and sort of abstract away in the same way that oh yeah just obstructed everything away and made it very look the same basically you gain a gained adoption that was fun yeah so when did you move to Nvidia that was like your honor how to go 18 and so tell me about the stuff that you've been working on it Nvidia yeah so it's quite different in terms of the application domain however I'm basically managing the the team building the a platform to imagine the team building the platform to develop autonomous figure software so and in autonomous vehicle software I also include like deep neural networks right like all the validation required for it and so on so this is what I'm managing it's a pretty big endeavor the reason for that is autonomous vehicles are are such large scales so many there are so many people working on it and they're so I mean there are so many specific needs that we have to build relatively custom infrastructure in order to be able to you know to be efficient and good at it and in competitive and do you mean it's like custom infrastructure for self-driving cars or custom infrastructure for every individual team working on self-driving cars it's so it's it's the nature it's the nature of developing mudders and what we call perception so the ability to understand the world you know from the chaos of multiple murders press custom budgets and so on so forth developing this requires requires a lot of customization the cloud infrastructure basically is what is what I'm saying so as an example you know all machine learning teams use a waffle system in order to say hey I want to do this task and then do this task and then another task right and so on so far in the case of in the case of autonomous vehicles the big difficulty is that some you know they'll get like this the the dialer steps are gonna be in so many different languages so many different libraries so one is gonna be like data preparation using spark one is gonna be like oh now I want to run the actual software from the cow on the target Hardware right which is the actual embedded are well that's racked in the cloud but I want to run it on this user using CUDA and then I want to run a girl a container so all of these things are so different that they require a waffle system that's agnostic to all of this and that can be deployed and on 8800 general I mean bypassing the details but basically in some aspects we have to develop our own customize' in fresh Italian I got it it's a like what I guess you're sort of starting to talk about this but what I like to the big components of the infrastructure that you build and what are the like the big no problems at each component songs yeah yeah I mean at the at the top level so where we interact with our users we really provide tools and SDKs and libraries so that's the top level at the bottom level we have really component that anybody start lever so at the top level what we do is we go from everything outside the car so when people drive you know like drivers basically collect data or test a new build of the software system then they take out the the data or send it another Wi-Fi then it gets into the system it needs to be ingested so that's the first step ingestion ingestion is already pretty complicated because it's similar to like Yahoo or Twitter where you need to you know write heavy and then have somewhere to like process the data once in a word transform it into datasets that are more consumable by users we have to do that those challenges are pretty massive because we need to test for data quality for example all we need to index the data we need to process what it is or transform it into something that's easier to consume downstream as well and so on so far so that's the first step second step is to build the best data sets and that's actually a big challenge the way we approach it is that I mean I'm sure you you're familiar with that but we've you like machine learning as you know software 2.0 like as Kapiti I cannot laid it out I don't know if it was the first but well data is the source code of machine learning and so we need to be very careful about how we write our source code and you know not to do that we were developing to to curate datasets but I create datasets select the right friends or right videos with the right filters make sure there's no overlap between training and validation so we have a lot of tuning for that and so these are tools they don't actually do this they like help a user pick this or do they they somehow automatically like picked abuts yeah so both a QT also also investing a lot in active learning since you're gonna figure it I'm sure you have a lot of experience there yeah but I'm always fascinated by we published a blog post recently exactly about that basically well actively I mean autonomous Vickers is perfect and lends itself perfectly to actually on aa massive amounts of data but very costly human Liberty right right so if you want to do 3d cuboid laboring it's so costly however are there like you know thousands and thousands of hours of data of driving available and so we really have to select the one that's gonna be the most efficient and that's going to find the pattern that the DN n GP on network is not able to find and you know what to do that we use actually burning and actually uh new basically gives us like fun soft empty scars right and the frames are videos with the highest and certainty are basically the ones that we're gonna want to labor in order to improve the performance and so we tried it and we get like three eggs higher improvement three to five eggs actually higher improvements using active learning simple data versus manually curated not even read them manually so like by humans because like humans like guessing what Dane is gonna be the best yeah yeah exactly so they're like the challenge was that let's find vivillon use when venerable road users at night at nighttime and so it's a super challenging problem because of course for cows is difficult to review at night right with the camera and so on and in the general like pilot pedestrians and bicycles basically these are the two categories or so difficult to detect anyway so we sort of the idea was to detect this one's the adesso first first cool was like manual curation a group of people who have told look through the videos and find in you know fine images that are relevant for these classes and the other group was like just using the models for these specific classes find the frames that are the highest in sovereignty for these two classes right so one was completely automated and we were able to find a friend that we're very very and soft and basically for pedestrian and bicycle right which shows maybe twenty thousand then the manual curation did the same twenty thousand what they did usually is that the the swipes through videos and when they find pedestrians or bicycles at night did you stop and they select like you know a few friends in that segment of video uh-huh then we we train weather with these two sub datasets we looked at the validation performance and radiation performance was three times higher I mean the increase in were three times higher for active learning selected data yeah so it does work and you can be completely omitted if you think of it right well that's really impressive yeah that's amazing and that's in your blog post we should definitely get a link to that yeah yeah with you I mean we were impressed too really because there was just an experiment you know a research experiment and now we're working to automate that and to be able to even automatically selling that our retrained models and improve that form and say we could have a machine fabricating unit right yeah with human-in-the-loop just for the Humanities okay what else are you thinking what else LD so this data collection this one yeah Vince there's labeling which I'm sure you're very familiar with but yeah yeah figure it but yeah I thought my speakers down some pretty massive changes so first the scale scale is massive right so NVIDIA has a thousand plus lay below in India yeah we're doing it ourselves I'm software to actually be able to dispatch requests right to disturb where else and manage it as you probably know is quite complex because it has to deal with human human what frauds right and the way they behave so when they refuse when they make mistakes as one so files integrate quality assessment in the loop and so and so far that's one but also the tools observe the UI tools are pretty tricky so for example we need sometimes to be able to draw like poison but really to berate and make sure we can leak lidar data with image data right so we need to have a mix of like human labeling and like automated computing to be able to like you know link these two things for example not like build a new representation of the data that is then usable by by those humans and I think the two at the same time is pretty complex and difficult to do so yes so we we built all that so that's that's step number 3 basically enabling the data then step number four is about training so we've developed a lot of code to enable all or every developers to train their mothers one of the biggest challenges we have is that once we train we need to export this you know we need to inference on an embedded system and so we are compute constrain in a way I mean this is one of the constraint we cannot deploy like a thousand servers to be able to crush everything we need to use a single chip to compute everything so in order to to make that you know right we've we use multitask training for example well we have one single model body where that can predict much about things like things like path detection or obstacle detection or light sign you know intersection summits on and so forth all right I think this is similar to what Tesla is doing I know they've written a blog like they've done a talk recently talking about that uh-huh is that and then there's a lot of optimization such as pruning the models or in eighth quantization or new or like to take yourself that we can use in order to even further reduce the size of the size of the model with equal performance and so your your tools do do all of this like is there stuff left for like a perception team and a customer to do or like how do you think about that yeah yeah no no I mean we provide this as part of the as part of the the core libraries but of course sometimes they need to do something new when they need to do that they can you know add their own algorithm and so on so forth and then we we basically productionize it platform is it also you know what they're really focused on on the perception side is not really those features it's more like I mean they are looking into it that's very important we're also looking a lot into new types of predictions for example so they were editing by seekers now they want to predict more fine-grained things right so they're going to have Casas yeah that's you know they do a lot of those things basically that we don't have to care about we just provide the contraction time so you provide kind of core infrastructure to do multitask learning and quantization that but then the the customer sort of provide the different types of like classifications that they would want yeah exactly exactly but of course like there's a small event between the two and we help each other yeah do you handle things like like some of the newer stuff like trying to figure out like intentions or like try to actually like map out the sort of like underlying dynamics of like a person like where their arms are and head is and stuff like that is within your scope yeah I mean not not my team specifically but the perception team is definitely looking into things like that yes that's usually more on the research side you're a bit more advanced but yes definitely yes okay and is this so does this work with different ml frameworks or how does it is it like a lower level the matter how does that work yeah no no we we do work with the ml frameworks just because they provide so much value so yeah for for training specifically so we use them saw flow a lot by touch a little bit too I think it's mostly historical and then for deployment though for deployment we use tensor RT which is Nvidia's G plumbing in France library and what's great is that it's really optimized for NVIDIA hardware of course there's a lot of like it's also optimized fine France so you can do some optimization of the graph for example and yeah we did by using intensity so yeah and we get pretty big performance games with that cool weights is that the whole thing so you sort of a data collection no that's not all evaluation so the variation of the murder so let's say you train one model there's two take your detection yeah what you really want is like understand if the modification of that model finds and parodies how is that going to impact the overall system that's a very tricky system that requires a pretty fine grain understanding of the impact and so let's say we have this perception system that's a mix of kind of like post-processing Kalman filters new owner to our since one so France they're all mixed in pretty complex ways what we want to do is like have multiple levels of KPIs and pretty large rates of KPI to understand what's happening in this system that's the step number one so for example like first positive force negative shock right yeah the next level which is at the perception API levels like Ayala verne you know yeah how many mistakes do I make per hour for exaggerated of detection of a car and I also want to understand even further than that how do I Drive the car so which involves simulation in that case so we want to be able to run simulation jobs with this nucleus option system to understand like how the system behaves now that's the same simulation so we want to do all of this so we have a system to basically evaluate all of these things at scale together which is which is on the same infrastructure so like same data structures you know same - bones same kind of output data send analytics library and so on and so forth and the output of this is like all these KPIs plus what we call events the first positive is an event for example I'm can define an event as anything once we have all of this and the AV developers can look at all this information and this is this is the next step this is what we call debugging basically which is also like software 2.0 base right debugging the output of a predictor so we look at the output of the predictor and we can look at the KPIs wicked even plus all the events and then zooming on to all of these events and I very fine-grain look at them and the traditional versus the ground truth for example or like you know see if there's something missing when there is a lens flare so we can go very deep and then come back high and then make a diagnosis about what's going wrong about the system and this diagnosis is the kind of like how we improve the system this diagnosis tells us like I need more data on JP Japan at night for example and then we can go back to the curation step which is building better data sets oh yeah this is kind of like the then we the feedback loop that goes from debugging to this question step that that helps us to improve our perception system and so you basically can your user could like automatically request like you know give me more like you know like bicyclists in the snow or something and then make the curation step go out and look for more that or like wait that more or something yay exactly exactly I mean based on a based on what the curation can do which could be geographical conditions or maybe temperature if we have access to that to that type of sensor but but yeah definitely yeah that's amazing so I mean how do you I'm just trying to think like putting myself in your shoes like how do you approach like making such a sophisticated system on behalf of customers like do you like build your own perception systems just to try your own software how do you think about that at the bottom of this like and - and what flow I have core components basically which is our data platform and our workflow management system and those two things are powering everything right to be able to write et else be able to be able to register data sets for example be able to perform queries about data and make sure that all of these things are traceable and 2n which is a major requirement for the autonomous vehicles industry so that we then you know it has a problem in the future we can go back in time and understand everything that happened so anyway all those things at the bottom are powering the the top layer and are pretty yeah I mean pretty pretty beefy and made for skip it and so sorry the first thing is data storage the right yeah oh it's did uh platform data platform in that a workflow management system and waffle management system you gotcha and so the data what is it's a data platform is just it's like keeping track of where all the data is or no it's a bit more than that so basically it's all did all the infrastructure required in order to start structured data structured and unstructured data ah right so structured data it could be anything like I don't know like simple floating points continuous values and raw data is like all the sensor recordings you know right so we have all of this and we can organize it and the second step is we want to be able to query all of this at scale ice and so basically we you know we use have expressed overuse box equal and spark in general to enable us to do all of this and so this is what the data platform provides all those pieces and then the waffle management is more around like the ability to to to schedule like those complex computing data access tasks right uh-huh and stitch them together and so basically we know we can organize data in a certain way we know we can access a cluster but then make sure that yeah like I explained on yeah we can perform those graphs test and sometimes we require a lot of scale when we do evaluation at scale for example unlike thousands of hours of data and so we need a waffle system that enables us who to do all of this yeah interesting so I guess like one thing I didn't hear you say that I think a lot of people talk about is sort of synthetic data like is that is that interesting no it is for us in theory we have we have a simulation basically a simulation team I think I mentioned it for a testing for testing I wasn't sure if that was like totally synthetic simulations or yeah entire well I mean we we can do post on open loop which is no control and planning in the loop like no actual driving yeah we can replay existing data uh-huh so that's that's really good because then we can measure on real data but for data like closed loop which is really driving in a world we need real simulated data and this is when video kind of shines because we can of course generate like simulated world like in video games and so even more than that we have the ability to generate all kinds of sensor data for the car so not just not like also you know lie down type radar data but also like can I am you all those things that are just specific we can generate all of this and we have a special box that we call constellation which has this generator like simulation generator on the other side and what we call Tec you like the embedded system on the other side that can process all those sensor inputs in the same box so basically do the exact exact simulation right exact processing of the simulated data so we can do all of this and we can use it for testing and we can also use it for of course collecting data and training on data that just doesn't exist in the real world for example yeah so very helpful for bootstrapping perception efforts for example bootstrapping new neural networks right in so just um right right right right I mean what where do you plan to I mean it sounds like you have like almost a complete like end-to-end solution for P like it like could you and and and like with the car and some sensors and and like get a system that could make an autonomous vehicle for me yeah that's that's exactly yeah yes you can accept that we need I mean it's difficult it's difficult to change stencils as you can imagine because personally we use a different sensor we can have to recollect data to revalidate and retrain models or French in them x1 so files but assuming their likes given all to what we have or all that job we need to pull we can redo that work attire yes so I guess you're like the perfect person asked like what do you think is like left to do to me I mean I don't actually see I live in San Francisco so I do see autonomous vehicles driving around a fair amount but like what what pieces do you think are left to to really work on to make it like a real thing that that I would use every day so you don't use it every day's what you're saying why I actually really I feel like I'm in the industry oh I see do you do you I mean do you have tesla like moto three or any no I mean I've played with them and I think they're like very very impressive so I guess maybe that's a good point you're making that but but like what about some I guess like I guess again what do you think is is is the next steps with systems like what are you thinking of focusing on honestly I think so first I think this is gonna be pervasive and I think in the in the future everyone's gonna have like autonomous bigger functionality that's number one yeah I don't think the vision goes even further than that is that is that cows like will become soft well fine on the way to becoming so stratified and that and that's you know like people are going to see a centralized computer with a really nice ui/ux right and they're gonna be able to buy new software potentially to a gradient cut uh and this is already what's happening with Tesla yeah then talkin and it yeah and I think it's really one of the reasons why they are there capitalization is so high the valuation is so high yeah but the other chemicals also looking into that that model you know and I'm like interested in it and I think this is the future industry so for us I think this is all sort of you need to be ready for this world at Nvidia so that means having like a programmable platform an open platform right because we want to enabled all those chemicals or share one to build those cook those those systems together on the same chip on the centralized computer we don't want to exclude them basically from algebra we want to enable people to write software on our chip infotainment software self-driving software and so forth right and and now so driving is so difficult that we can provide it for them I know as a given application for for big chemicals or even smaller and and and and just and just yeah like develop it has an application from the but and then where we going with that I think is a matter of like performance and improving the control planning and the entire perception system I think we all still like at the beginning of it and we're gonna be able to do better and better and better over time by building a lot of automation first I potentially adding machine learning in areas that don't have machine on and yet such as predicting the the planning past for example right uh-huh I think things like that yeah and anyway yeah that's pretty much it I guess what I'm hearing is like you think that they're sort of like iterative improvement in a bunch of different things and then applying machine learning to planning is like the big like just sort of the next steps in making these systems you know work better I'm curious like what like what do you think of the like the stuff that people really like wrestling with right now to make otherwise really work I think the big the hardest thing is albin areas right now so like being able to drive in urban areas in New York City for example it's really really hard and that starts the next frontier and requires all sorts of new signals coming from the car you know like for example like intersection and slides like lack of lanes right things like that that can be very tricky or like unknown kind of like vehicles such as garbage collection stuff like that right so all of this is still a little bit newer and older older like self-driving provided or started with easier area such as highways except for some of the level 5 like lift you know like the ones that are trying to already leapfrog that but that's a big challenge video game yeah that's the next one jacket do you think like your approach Nvidia is significant different from my Tesla or lyft or how do you think about that yeah yeah I mean all these companies they are targeting different things so as a result there are some differences so lift is targeting level 5 they want to have fully autonomous vehicles Tesla is building cars so they don't they don't they no need to build a platform that's usable by other people for example ah right on outside will be a platform and we are where we make money with our hardware you can also make money with our software but all software has to be like usable by everyone else so we have to make it in a way that that is set right so this is one of the constraints we have as a result for example the platform we're building by the core infrastructure is designed in a way that can be potted in other like chemicals for example at all or any right like people developing so driving do you think your platform I mean it's interesting because the the all the the pieces that you mentioned of your platform I think like it's super relevant to like healthcare applications almost like any kind of deep learning application like you know how do you think about would you ever expand your platform to other applications or yeah I think I think that's that's that's possible some of these pieces are not require per se and sometimes the scale we aim for is not required as well for a for have scale for example yep and however yeah usually what we try to do is what we built is Maalik a superset of those tools and push the the front a little bit further now some things are a little bit tight to autonomous vehicles but the entire end-to-end waffler though seems very applicable you're right the towards themself that would be sometimes we are customized to the data we have so yes we could extend them it would just require some work basically yeah I think I'm curious this is a kind of a specific question but you know I've been thinking about this lately like how important do you think is the the sort of Piper parameter search piece like the neural architecture search you're talking about like is that is that really essential or is that like a nice-to-have so uh you wanna click your search is is right still it's still like something we're exploring I think it can be important because we can really reduce the the compute footprint ah for us so I think it can well so for exemplary we can constrain the search space around you all architects yourselves to something that's going to perform really well on target Hardware in terms of latency right right because invidia is like some hardware accelerators that are specific and so we can make sure that we can get this and find the architecture so and so far however I prefer metre search is something that we have available but the the kind of like the advantage of using data is the computed requires is often like not super interesting for right for for therefore for developers so we do it sometimes but it's not really like a big a big advantage I would be competitive advantage I would say for us I think yeah actually is there a piece of your I mean your platform sounds amazing and it solves like you know a whole slew problems is there like a piece of it that you're like especially proud of that you think is like really like like really stands out to you is like best in class or I really like the actual running part and everything that goes around that because so one other thing we are doing is what we could target in learning which is the ability to take perception bugs so like oh I'm not able to detect you know like trucks in that position whatever and then use that and simple data set that then is going to be used for training and fix the caption bar uh-huh right so and and doing that is similar to active learning but like condition I conditional active money so I'm really proud of these two things because I really loved the automation of it all right like you we could just go on vacation and be like okay now just the system worked we have like you know customer sending their bugs and automatically we just fix them you know cool well this is this is so fascinating actually you know even if we were recording this for something I mean I love talking about exactly yes it's great to meet you thanks so much dancer that's cool well thanks a lot all right that was such a great conversation Thank You Lucas and Nikolas I'm gonna add a link to necklaces twitter in the show notes below and I would highly recommend that you guys check him out also if you'd like to continue the conversation we do have a very active slack community with over a thousand machine learning engineers and I would love to see you guys on there finally before we go I would love to talk to you guys about something that I'm super excited about so lately we've been working with a lot of self-driving car companies at rates and biases and that means that we've been building native support specifically for self-driving machine learning models so now with just a few lines of code you can do object detection with 2d and 3d bounding boxes with inmates and viruses you can also do semantic segmentation so you can compare your models predictions with the true labels inside your data set and finally my favorite you can now log point outs for then rates and biases so that means you can now use point outs to understand your seam with custom annotation layers you might use this with something like a data set of lighter points so for example let's put out a self-driving car data set composed of lidar points and you could plop that in to weights and biases and draw nice little 3d bounding boxes around your cars people and other objects within your within your theme it's a great time I'm gonna leave some links in the show notes below so you can try out point out semantic segmentation and also object detection it's a really fun time whether you're working on self-driving professionally or just for fun I would love for you guys to try it and tell us what you think finally you can also use weights and vices to run sweeps are to tune your hyper parameters in a very organized way this means but you can just give us a list of hyper parameters that you would like to search through and also a search strategy and then we will go through and train all these different models and find you the best one in a very organized way which is very low effort on your part I'll also leave some links down below for you to try out our sweeps that's all for today we'll see you in the next episode have a nice day,8335 +Brandon Rohrer — Machine Learning in Production for Robots,https://www.youtube.com/watch?v=_Ot35PspXw4,2071,2020-03-10,hi I'm Lucas and you're listening to gradient descent we started this program because we're super passionate about making machine learning work in the real world by any means necessary and one of the things that we discovered in the process of building machine learning tools is that our users and our customers they have a lot of information in their heads that's not publicly available and a lot of people that we talk to ask us what other people are talking about what other people are doing what are the best practices and so we wanted to make all the interesting conversations that we're having and all the interesting things that we're learning available for everyone out there so I hope you enjoyed this Brendon Brewer is a mechanical engineer turned machine learning engineer / data scientist he's worked on some incredible robotics projects and then worked on data science projects at Facebook and Microsoft and currently he's a principal data scientist at iRobot at the same time he's an instructor at end-to-end learning where he's made some amazing videos on convolutional neural nets and other things I'm super excited to talk to him Brett it's really nice to talk to you and thanks for taking the time you know you've worked as a it sounds like you've worked on machine learning at a quite a range of different companies and most recently iRobot and so I can't help myself but I'd love to just hear about you know what what kinds of challenges do you have at iRobot and and robotics in general yeah thanks for having me on Lucas and Lavanya really appreciate it at iRobot we get to actually support these little giant frisbee bees that run around on people's floors and suck up dirt those are the vacuum cleaners and also there's their mops which are run around on people's hard floors and clean up messes and what's really fun about this is if you think about production machine learning systems having to deal with whatever input or you know badly formed requests that you might encounter imagine taking that to the physical world then you have something that is bopping around literally every type of home in the world there's 30 million of these things out there now and as hard as we try to imagine we can't imagine all of the challenges that they will come up against and so this is really fun from an algorithm design standpoint and just an engineering standpoint making something that can get beat on can have cats ride on it can you know run into all kinds of things encounter cords and counter socks and counter Legos and counter Skittles and how is it gonna handle all of these things um that to me is really fun it's the polar opposite of the sandbox carefully prescribed problem where you know exactly what your data is beforehand and you know it's been cleaned up so it's gonna give you a good high-quality answer in the other end I'm curious like I feel like I've had a lot of friends I'm the last couple years kind of moved from consumer Internet ml applications to to robotics was there like big things was it was it a big adjustment for you like were there big changes or was it mostly kind of the same set of issues uh let's see definitely changes but for me this was coming home so robots for me is where I started right my degrees are in mechanical engineering and my graduate work was all about using robots to rehabilitate stroke patients so you know knowing things that could break all the time and not to trust your sensors that's kind of what I grew up with then when data science became a more common career path I rebranded myself as a data scientist and went to agriculture went to Microsoft doing cloud machine learning solutions for a variety of different companies I went to Facebook infrastructure which is a fascinating set of problems around keeping one of the biggest networks and instead of data centers in the world up and going and running efficiently and all of these things what I enjoy about them is that you could not ignore where the data came from you had to know something about either the people what state they were in when they generated it you had to know about what pre-processing at Haag you had to know about like the assumptions that were made along the way if you didn't know this then you couldn't build good models to get answers out of it and so robots just take this and they put it front and center and take it to the extreme because if you don't know what a given sensor value means in the physical world and it's really hard to build a good model around it and know how to interpret it naive models where you just throw things in unless you get really lucky baby just don't work well so does that lead you to kind of like simpler models like does it like do you do you are you more afraid of complexity than for these applications um so my personal strategy when faced with something like that is if I don't know everything is going to come up against the biggest thing I want to make sure is that when it performs poorly it doesn't do horrendously bad so I don't know if you well so how do you do that I mean tell me about that simplicity is a good one so knowing exactly what happens one example of this is in agriculture one of the hard modeling problems is I have a field I have some corn seed I planted on a certain day I use this much fertilizer here's what the weather and the precipitation is all season how much am I going to harvest at the end of the year if you had a model that could spit that out you would have solved agriculture or at least the yield problem in agriculture but there are so many variables in the model we were working with it was a popular academic model that had literally hundreds of variables in it and there's no way that you have enough data to train that model well and what was really funny is when we did some analyses on that using popular settings for that model you could have a really naive model which just estimated a flat rate for all fields everywhere for all conditions and then this really elaborate several hundred parameter model the flat rate model did like twice as good funny cuz I would think with like plants that'd be like a lot of complicated interactions that but I guess you just don't have enough data to to know and you're exactly right and that's why I be many parameter model was didn't do so Ellis there it did account for a lot of interactions and to give them that worked the way they were supposed to you had to get all those parameter values correct or in the neighborhood and we just didn't have enough information to do that so an alternative approach then is to start with a very dumb estimate and then incrementally make it a little smarter lead by the time I was done I was working with like a three parameter model and one for you know precipitation and one for soil texture and one for something else and and to be able to check it each time and really just listen to your data mm-hmm so same holds true then for robots or for anything else it's like simples good simples is there anything else though like to make sure that your models are sort of robust in the face of different different types of data it sounds like you're also maybe like kind of pulling apart the problem into subproblems definitely that some problems are amenable to that you know and to the extent you can separate into subproblems and that is a great strategy although not everyone thinks that right I mean like I said sometimes people talk about sort of like end to end autonomy right so I do think that's a little bit of a point of view that's true that's a good footnote I will say that that is my opinion I don't I don't think that's general but it is easy to get over ambitious about you know to fall in love with your model and say what it explains potentially so many phenomena like it must be right if we can just get the data to Train it correctly but then when you actually in experience like we've seen this even with some fish models the actual high-quality label data to train it well would cost so much together that it's impractical and so in that case model doesn't too much and if you close your eyes to that move ahead with poorly labeled data then badness happens and you get models that are worse than no model at all yeah I've ever experienced that I'm kind of curious it's funny you know I have developed a real interest later in life and sort of mechanical engineering and electrical engineering and I feel like you've kind of gone in the opposite direction I guess like one similarity that I've found you know of sort of like mechanical problems and machine learning is that you don't get good error messages do you think that you're like backgrounds in mechanical engineering has like helped you in certain ways and machine learning or how did you go about learning a new field because lots of people wouldn't do it and and how did you like bring the knowledge you had to help you there in my case it was motivated by a problem I was trying to solve no cool and so in my work we used robots to help rehabilitate stroke patients we saw changes in their movements a good research question is that what's going on in the brain to make their movements get smoother like what's going on there and the more you dig into human movement a whole collection of questions bubbles up how does the brain control this hardware that's sloppy and that changes over time and that's not very accurate compared to like you know precision machine tool robots and with huge time delays time delays that would take any like off-the-shelf robot and drive it unstable but the brain does it casually like we do it like we're half freezing and our neuromuscular dynamics all changes the brain compensates we're drunk and all of the time delays change brain compensates like how does this happen so this was the problem that I wanted to solve is how could I make something that could control a piece of hardware that I didn't know or understand very well and it was going to change all over the place and so that led me on the path of learning what I could about how the brain works which if you from the point of view of you know now I want to turn it into an algorithm there are still huge gaps there even though we call neural networks no one that works they have no resemblance at all to anything that goes on at the brain and so studying that and then figuring out if it was successful what I what I want it to do and that did leave me been into studying different signal processing methods and different algorithms different families of algorithms and then once I started the data scientist for my day job then aside from this research interest there was a whole other like professional motivation to dig into these things and and then once I started writing tutorials and teaching these them they're become even more motivation to learn these things into you know be able to understand them well so they're kind of built on itself that original problem of kind of building a general purpose you know brain or controller that could you could pop into any robot and then they could learn what to do with it it's still a long-term passion of mine you know it's my like personal 30-year project what makes that hard the real world the robots never going to experience the same thing twice you're never going to get exactly the same camera image two times in a row so one thing that's hard is you have to deal with always new experiences so you can never learn exactly what you do in this situations you have to learn what other situations are similar that that sameness is very hard and we humans do it so well that it's almost it makes it harder to put down into code you could say that about like image processing or audio processing and I feel like when you look at the progress in terms of like you know facial recognition or like understanding voice it's like really spectacular right like you know we see it like in our lives all the time you know for better or worse right but I feel like we don't see like robots running around like we might have expected and like the the feats that robots do that I'm kind of wired to be impressed by or actually like incredibly unimpressive until like my mother right whereas like you know I think I'm like in every other field the stuff that I'm else doing is sort of amazing like you know compared to human but I feel like it in robotics you know we like look at what a Roomba does and it's like you know it kind of blows our mind but you know my cat can can do but impressive yeah less helpful kind of creating a mess instead of undoing it but still like what why is like robotics so particularly hard so the similarity problem once you get past something concrete like this face belongs to this person in different situations but here's like I'm in a city I've never been before if I had to guess which way the hospital is you know how would I do that and so like we have a lot of subtle things that we do to like get oriented in novel situations uh-huh that's one aspect another is that machine learning the way it's set up right now it requires just a whole lot of data so to learn basic things categorizing images just like it you know something that we train animals to do on a regular basis not even very brilliant animals that requires huge amounts of well labeled data and if you get some poorly label data in there you can mess it all up to learn things to do the thing where you've like maybe never done it before and you have to make a reasonable guess your first time there are some people making efforts in that direction but it's still very early days what do you think about the approach of kind of simulating data there's that seem promising to you yes in fact another thing that's really hard about doing robots is it's hard to keep your robot up and going so if you've ever worked in a robotics lab if you get like three solid runs it's like great you that write that up that's your thesis you're done like before the next spring breaks so a simulation is really useful for that because you can run a robot thousands of years in simulation time and get you know generate that volume of training data it's not without its pitfalls though I've worked with simulations and if you talk to anyone who has they'll have a story about how there's some quirk in the simulated physics or the simulated world and your reinforcement learning agent learned to take advantage of it so like in mind there is the seven degree of freedom robot arm and it learned to reach down into the table because I made it too soft and use the table as a guide and then come up to pick up so good so there is a paper that came out not too long ago about open using a shadow robot hand with a whole lot of robot simulation to do some of the steps to solve a Rubik's Cube and a good part of what they did was getting the simulation right and in fact if you read closely they actually went back and modified their experiment and their physical hardware in order to make you simulate Abel it's not like it is no it does not come easy but potentially it's a really useful thing what is the toughest part of a coming from a model in your lap that's something in production so when it when I'm working with data my laptop you know I run it it's it all fits in RAM get an answer spits it out you know whatever makes an image saves into a file that's all good get an acceptable error rate I'm good to go taking it into production that then that trained model then becomes like the easiest part of the whole thing depending on what you're using it for let's imagine you're using it as part of an app or part of a service where somebody somewhere on their phone or on their laptop has to do something that needs the result of this let's say it's own weather predictor or coronavirus risk predictor or something like that all of the pieces to get that request to make sure that it's not part of some you know denial service attack to put the make sure that the request is well-formed and it's not going to like gum up your model to get the answer out to make sure that it gets delivered all of those pieces are break them down individually and they're fairly simple you put them on a piece of paper and it looks like a bunch of blocks connected by arrows and it's like oh okay here's what all the things do that's great but I I don't do this myself but I sit next to people at work who spend their days making sure that all of these blocks run smoothly and all of these arrows are working the way they're supposed to and it is a full-time preoccupation or a full-time demand on your attention to care for and feed these they are all running on computers that are in data center somewhere they're all running on software that's being regularly updated anyone who uses Amazon Web Services is probably familiar with new services and new capabilities coming all the time occasionally there are breaking changes so it worked yesterday doesn't work today and it is a lot harder than it sounds when someone tweets out oh cool I spun up a cluster and now this thing's running a thousand times faster and it's like that's super cool but that hides a lot of effort that goes underneath and it also hides a lot of a long term investment required to keep that up and going I guess though you know I totally agree but the the things that you've described feel like the difference between sort of any kind of demo in your laptop and any kind of like production thing but I do feel like there's at least kind of like a trope or a meme or something about how like machine learning is like particularly hard to do this with D do you think that this is there's something special about machine learning that makes it extra hard to put the stuff around it and make it stable or do you think it's just you know people just get too excited about demos in general um yes yes so machine learning specific issues are it's almost impossible to consider all the possible inputs you'll get for instance if you want to take an image as an input and say put a filter on it or do some kind of identification on it and so it's very possible but you know basically your users are now adversarial and so some people out there are going to either intentionally or on accident come up with things that will break what you're doing and so being able to identify that I mean man I take the service down but it might produce a result that's undesirable or offensive or at least embarrassing and so you can have to keep an eye on that the other thing is a lot of times when we train a machine learning model there's you know a training set validation set maybe a test set you train it you get good results on that data and you go that assumes that the world doesn't change which is a terrible assumption because as soon as you deploy that model whatever phenomenon you're modeling is going to start gradually shifting so a great example of this is whether so if you had a really good weather predictor in 1970 it would probably not be worth very much today and having the ability then to not have a static model or to periodically retrain and redeploy is important if you want to keep that haven't going those are those are the two big ways that I've seen machine learning models in particular okay you wrote you wrote yourself a really really softball question but I'm kind of dying to know which you're gonna answer it it's a go Brendon what's so great about robots so I deleted to this already but there's more so I for me personally there is passion around robots when I was a kid I'm five years old and I'm watching the Empire Strikes Back in the theater and Luke at the very end he gets his hand cough of course it hands up in this prosthetic hand and there's these mechanical actuators in place of tensions and I just thought that was coolest thing I've ever seen so that right they are gonna set my career path and so you know I'm in graduate school now a mechanical engineering and working with pres theses and stroke rehabilitation and I have like circuit boards out and I'm on the phone with my dad who's an analog electrical engineer it's like hey I need to build a preamplifier for this signal because the sensor is not strong enough and I like have to package up and shrink wrap it and tape it to this you know motorized prosthesis that we're putting together and and then I have to like read it in through the serial port and write some C code to pull the the values off of it and read it in and it's just like it's down and me like the electrons like it's it's the interface between the physical and the software world and maddening and frustrating and so many times it doesn't work and then you know after weeks it does and I just stand back I look at this is all there's no boundary between the real world and the imaginary world and the computers like it's all real there's physical and there's digital but there's a blurry line in between and robots span this and so with robots you like embrace all of the chaos of this physical world and you really have to put your money where your mouth is with regards to control and learning and you know that your sensors are gonna fail and you know that your actuators are gonna change performance over time like you have to be able to handle all of this stuff and when you're done if you do it right you have a little thing that you know if you suspend disbelief it looks like it might be alive in some small cool way and yeah I don't that's pretty cool that's that gets my malaria goes okay so I gotta ask you so one thing I think about is um it feels like we don't actually engage as many robots in our lives today but I don't think it's so much the cop so the materials right I mean the you know I mean I Robot maybe has one good example of a robot we do use regularly but it seems like there's a lot more things where you could build them but it would be hard to kind of make them smart enough to be to be useful and software so amazing right because once you make it once you can copy it and put it in in everything do you I sometimes I wonder if like a day will come that there will be like robots just like all over the place doing useful things for us do you imagine that will happen or is are there like breakthroughs that that are kind of unforeseeable what do you think about that I very much do and this the the reasoning the chain of reasoning you just followed is exactly what I did when I was you know in my graduate program I'm looking around I'm thinking like yeah hardware capabilities are pretty cool there were some robots at the time out of Berkeley that had crazy number of degrees of freedom you know big as a person could do all kinds of things but the gap between what can I do if I was controlling it with a joystick and what can I do with its own brain was so big and a joysticks not even a very good interface you know what could I do if it was tied right into my brain you know it's like yeah huge and I and I realized that I wanted if you went to robotics typically you focus on hardware or software so it's like well software seems like the short bottleneck here so that's what I'm going to focus on um for now a lot of the emphasis on machine learning methods is driven by performance on benchmarks that's good if you have to publish papers you need some basis for saying this method is as good as or better or close to some previous method and the benchmarks are good for that but it's gotten to the point in my opinion where the tails kind of wagging the dog and we only pursue the problems that we have a good benchmark for so image classification and all the world of machine learning stuff image classification is a tiny tiny little piece of problem you could solve but you wouldn't know that based on popular press and based on random sampling of papers and neurons for instance that is starting to change the last couple of years you see a little bit more kind of like people going rogue with architectures or with the problems that they're willing to handle and I think more people are getting a losing a little bit of patience with like okay image classification like facial recognition system the flavor of an image classification we can do all this pretty well but we are really like bending our universe around this one point why not branch out a little bit and as we were willing then to cover a little bit more of the space of the problems you have to solve will get robots who you know like the room but you might not watch it and think like man that's like vacuuming more efficiently than I would be watching you think like okay it's getting to all the corners all the edges covering everything in its own time like great I can go off and have a coffee and be confident that it's gonna do its job I think we're gonna get more and more of that cool that's so cool and I gotta say you know I think the the biggest pleasure of doing an MMO tools company like we're doing is getting to talk to guys like you who are actually kind of taking these applications into or taking you know the technology of machine learning into these applications where you like really see them and it's so cool to just you know kind of see machine learning going everywhere so I I share your excitement we should oughta let that I feel like your answer to why arrow that's cool like we should we should like make that like a video you team and everyone will well I remember I watched Return of the Jedi in the theaters I think I might be like two or three years younger left me like catatonic and like hid nightmares for years apparently Han getting frozen in carbonite was not a traumatic thing so what do you think is one underrated aspect of machine learning that you think people should pay more attention to the next neighbor that I'm really excited about so we have like you know image classification also some really cool things with word prediction are happening right for the plucking is unsupervised methods being able to automatically do clustering automatically learn the similarities between things that are you know may have many variables might be really complex but to be able to say like I've never seen this situation before but it's kind of like this one I saw in the past and to be able to make use of what we've seen before I think that there's a lot of work that could be done there for a modest amount of effort and it suffers mostly from the fact that there's no one right answer and so it doesn't lend itself to benchmarks but you know if anyone hearing this wants to say you know screw benchmarks I'm gonna go work with unsupervised learning like I expect it would be a really fruitful way to spend your time we just think I wanna I wanted to be intriguing all right so the next question is what is the biggest challenge of machine learning in the real world if I had to pick one that is the biggest in terms of impact it is misapplication it is easy to you know treat it like a hammer and beat on anything with it without regards to regard to whether the hammer is the right tool for that and so we see places where you know models are trained on a grab bag of data about people that can transfer biases and transfer historical injustice --is because those are the processes that generated this data and the new model will just know blithely perpetuate that and there are few people who are kind of intellectually in a position to see how that works some of them are wonderfully vocal but but it's still not not all of them and I think that the biggest downside the biggest difficulty is that those who don't know or don't want to know about that will continue to use perpetuate these two to end up hurting people and you have a particular example that the specific like bothers you or that you'd want to collab so facial recognition in law enforcement is one that comes to mind right away it is demonstrably inaccurate and especially for non-white minorities the accuracy is even worse than average and so it's just a way to cause Oh many more problems than it solves on the surface especially when it's old the right way it appears to be a useful tool and you make lots of great claims about it but that washes over the harms that it does and that's that's not even touching facial recognition used for like overtly you know discriminatory purposes which is you know completely completely alright well um but finally let's close with one final question is where can people find you if they want to keep this conversation going what's the best way for them to reach you yeah I so i'm online fairly active on twitter handle is underscore B are Oh H ar e AR underscore also on LinkedIn regular posts brand and roar and a lot of my choices stuff my labor of love goes to the end-to-end machine learning school some course materials I put online and that's at e the number to e ml school so e to e ml dot school ok cool we can put all these in the notes too so so people can find them easily fantastic that was awesome man that was so fun thank you so much thanks Lucas I really enjoyed the conversation I appreciate you and Lavanya setting it up all right that was such a great conversation Thank You Brandon and Lucas I'm gonna add a link to Brandon's Twitter account and also to his course in the show notes below I highly recommend that you guys check it out if you'd like to continue the conversation we do have a very active FAQ community with over a thousand machine learning and and I'd love to see you guys there I'll add a link to the FAQ community in the show notes below before we end for the day I'd love to talk to you guys about something that I'm super excited about so a treaty basis we love traditional machine learning as much as we love deep learning so we built a psychic learning integration that lets you track your model performance compare different models and be able to pick the best model we also help you do high parameter sleep on your psychic learning models that lets you find the best iteration of your psychic model but one line of code you're able to create really cool parts like the ROC curves precision recall plots learning curves confusion matrices calibration curves and a lot of really interesting classification regression and clustering thoughts I'll add a link to the show notes below so you can get started right away I would love to have you guys give it a try and tell us what you think we're always trying to make our product better and I would love to hear feedback from you guys that's all for today we'll see you next time but then at the great episode bye,5792 +Dave Rogenmoser & Saad Ansari on Growing & Maintaining Jasper AI,https://www.youtube.com/watch?v=J4_mO-MN5gI,4155,2023-02-16,"Dave: People just don’t have an understanding or a grasp of what is happening. Fundamentally, what are these models trying to do, and how do they respond to certain things? There’s just not anything anyone’s ever had experience with before. Coming in, we’re not just teaching them how to use our product, we’re trying to teach them, “Fundamentally, here’s even what AI-generated content means.” Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. This is an interview with Dave Rogenmoser, the CEO of Jasper AI, and Saad Ansari, the head of AI at Jasper AI. Jasper is one of the most exciting breakout successes in text generation right now, and a pioneer in using prompt engineering to build successful business. This is a really interesting interview both about entrepreneurship and applied machine learning, and also technical details around large language models, and the future of how prompt engineering will work. I learned a lot from this interview and I hope you enjoy it. Well, why don't we start a little differently. I was thinking this is what I would need as a ML researcher, which is mostly our audience. Could you explain how a marketer would use Jasper and what they would get out of it? And maybe even get concrete about what people love so much about it? Because you mentioned that people have a real palpable excitement about using the product. Why is that happening? Dave: Yeah, marketers have a lot of content to create, and most of them would create infinite amounts of it. Nobody ever has enough blog posts, nobody has enough...I mean, at some level you probably have enough to add creative. But, you run a bunch of tests, and all of a sudden a week later everybody sits down and they write all this stuff. They test it, and a week later they're out of things to test, and they go six months without ever testing another headline again for their Facebook ad. The only way to do this has been just through manpower, and just trying to hire more people, and dedicate more time to it. With marketing, it's such a thing that a little bit better headline can be the difference between successfully and profitably spending $100 million on ads, or spending $3 million and having to shut the whole thing down. A lot of this is pretty thin margins between a whole campaign working really well and it never getting off the ground at all. Yeah, there's just so much at stake and so much value to add there. Marketers think highly of themselves, like, ""My writing's different,"" and all of those things. And it is, for a lot of them, but I think when they saw Jasper, the fact that it was pretty good — in some cases better than them — and it could do it in an instant...it freed them up to go from the marketer that has to stare at the blank page and do it themselves, to now a little bit more of the managing editor. Like, ""Jasper will give me all the raw materials here, and he'll give me the first draft, and I can pick and choose, and assemble, and all that stuff."" It kind of moves everyone up a level there. So, that's what marketers use Jasper for. Some of it's just high-volume stuff, they just need to create a lot of it. Some of it is creating great content that's made better than you would already create. All of it is marketers trying to just get an edge, and just get out a little better content, more content faster. Lukas: You were one of the first companies to really make commercial use of large language models. Could you talk about what that process looks like? How do you make the large language model into something that people are actually willing to pay real money for? Dave: I think the first thing is you've got to know a customer base, and know it really deeply. That's, I think, always been our secret. I just know the customer base super deeply. I am the customer, I've sold a lot of stuff to them before, they're my friends, they're our community. I'm so in it and I'm always looking for ways to make their lives easier, make my own life easier. I think a lot of people just aren't connected to any sort of end user in any meaningful way. I remember, I joined the OpenAI Slack community back in the day, and the day I got the credentials, I was like, ""I am Thanos here."" I'm like, ""This is ultimate power."" And I get in there, and I'm the only one talking about building something that people would want. Literally the only one. There's 1,000 people in there, and they're all translating the Declaration of Independence into Elvish, and then back out into album art. I'm just like, ""This is cool, but literally no one's talking about letting regular people use this stuff?"" It was confusing to me, still maybe perhaps confusing to me. I think there's just a huge market for taking this stuff and making it useful and solving some specific problem in some way. But I think, that's where it started for us. Just like, ""Hey, we made this tool."" I knew deeply what we wanted, and I felt like I knew the outputs that would get customers excited and that would get me excited about. Early days, it was just playing around. I didn't know anything about prompting. I'd love to see my first prompts, they were probably just nothing. I just dove in for a week, and just started really crafting these things one-by-one, and messing around with it all. I'm sure Saad has replaced everything I've done at this point now. The early days, it was just a lot of testing out. I think everyone was learning what to do there. But really, I just kept coming back to the customer. I care about all the settings and all the prompting — all that is just a byproduct of, ""We've got this problem that needs solving. And if this tool can solve it, then let's try it."" Lukas: Is the problem somehow more specific than just, ""Write me some content on this topic""? How do you think about that? Again, I'm not a marketer, so walk me through what they're thinking. Or even, could you point me to a specific marketer that works with you, and tell me how they think about the content that they get? Dave: Let's use blog posts, probably our most popular use case. You search for something on Google, there's only going to be 10 results that pop up on the first page. So, your blog post that you write has to be better than quite literally millions of blog posts that could surface and be somewhat there. You got to be in the top 10, but then even past that, you really got to be the top 2. There is a pretty big fall off even as you scroll past the top 2 or 3 there. It's not just about getting a blog post out — which is maybe what earlier models could do — it's about getting a really good blog post out. And it's really got to win in this marketplace of ideas. It's got to be compelling, it's got to be engaging, it's got to be factual, it's got to be helpful, it's got to be written by somebody that deeply knows the audience. Just clicking ""compose"" on some large language model, it's not going to be enough. It's not going to be enough to win, particularly if other people have that. Our customers, they want really high-quality content. I always challenge them, I'm like, ""Don't even bother writing low-quality content. This isn't some article spinner."" One, I just don't feel good about building that kind of product. But two, it's just not going to work for marketers. It's just not going to get clicks, it's not going to rank on Google, it's not going to get people excited when they come to your landing page, if it's just filler stuff. So, from the beginning we've always said, ""Hey, we're here for high-quality content."" To the degree that we can help people produce that, we will. That's going to be a big part of our focus, as opposed to just an article spinner that just spins out tons and tons of stuff. It's just not going to stand up and actually produce the ROI that marketers want. Lukas: My experience of just working with GPT-3 is, it's an impressive product for sure, but I don't think I get what I would consider high-quality blog posts out of it when I just mess around with it. Can you talk about how you actually got it to deliver high-quality content? Is there a human in the loop here that's tweaking it, or what's the process like? Dave: Yeah. Well, you certainly — even in Jasper — can't just go and click a button and get a high-quality blog post out now. We really talk a lot about it with our customers, like, ""Hey, it's a dance. Jasper's there to help give you..."" I might start with blog post titles or blog-post topics. ""Hey Jasper, give me 10 blog post ideas around this topic. Okay, that's a pretty good one."" And that helps me start off in a better spot. ""Jasper, give me 10 titles about that topic. Okay, cool, that's pretty cool. Jasper, give me an outline that I can start to work off of here. That first one stinks. Give me three more outlines."" You're basically going back and forth with Jasper to help assemble it. If you don't know what a good blog post is, you're going to be in trouble. If you don't really know what your reader wants, you're going to be in trouble. Because Jasper's not going to know all that. But what I think Jasper does do a great job of is, if you're able to help piece that together, you can assemble a really great blog post. Some of it'll be used, some of it'll be steering the output there, but it's just using a variety of different tools to do that. You can get some really, really, really high-quality content that's really remarkable and that readers really want, but it is going to take a human in the loop doing that. So, that's happening there. And then our team in the background is testing out all different models that produce blog posts best. There's all different prompts that produce different types of blog posts. Should we have one tool that creates a general blog post? Or are there actually five types of blog posts and we need five different models, each one a little bit more specific to a listicle, or informational blog post, or whatever? We're trying to do all of that behind the scenes and simplify that for the user, and just turn it into this magical experience that they can just show up and start getting to work. Our goal is that the software will become invisible. Lukas: You were the first person doing the prompt engineering, is that right, on day one of the product? Dave: Yeah, it was me. It was me. Lukas: What did you learn? Teach me how to do prompt engineering. What are the first couple things that you figured out when you were just messing around with it? Dave: Oh man. I mean, I just didn't know anything. First, I tried treating it like these instruct models where it's like, ""Write a blog post for me."" And it's just like, ""Write a blog post for me, write a blog post for me, write a blog post for me, write a blog post for me."" Okay, cool. What's happening here? There's patterns, and it's trying to figure out what I want, and all of that. I think really early days — and we still get a lot of benefit from this — is the examples that we would give it, for few-shot outputs, really were important. I felt like a lot of our competitors...we're marketers, so we're just probably sticking — again, I don't really know what any of them are doing — but we're just sticking whatever decent examples in there. And I had a really high bar. I was able to use examples that I knew for a fact converted really well on Facebook, I knew for a fact performed really well there. We just always used stuff that was proven in the market to start to steer and give examples there. So we'd get really, I'd say, highly opinionated, but really high quality outputs out of it there. But yeah, it was just me reading every doc I could. There weren't that many docs back then. And just talking to everybody and like, ""What's top-p mean?"", and all of these things. I just had no idea. But I knew the output I was trying to get to, and I wouldn't stop until it was like, ""Man, that is really good and I would really use that. I think that's probably what's harder, I suppose, than figuring out what top-p really does. Lukas: What does top-p do, what is that? Dave: Oh gosh, I was just hoping you weren't going to ask me. We got people here now, we got Saad that can do all that stuff. But if you're listening, let my ignorance on top-p go to show you where I think the real value lies here. It's outside...or, in addition to knowing what all the little things do, it's like, ""What is it useful for? And where does this really play a role in society?"" Lukas: Well, when you think about hiring a prompt engineer, what do you look for? Is it domain expertise in marketing and content? What else would you ask me if I showed up interviewing for being a prompt engineer? Dave: For us, I look first and foremost for an understanding of the customer. Or at least a willingness and desire to understand the customer, and an empathy for the customer. I really do want...we have people that apply that really want to do AI. ""I really want to do AI. I want to be at a cutting-edge company. Generative AI is so cool,"" and all that. That's fine, but if it's just that, I think you're going to struggle here. I want it to be, ""I really want to use AI. I love the customer base. I can see the problem, I can articulate the problem that we're working on here. And man, AI happens to be a really great fit there, and I'm excited to find what else really helps solve that problem."" That's really what I'm looking for, as opposed to this, ""I just want to do AI."" When I think about a prompt engineer, we got away for a long time without anyone that really knew AI, I would say. We were technical, and we could hack it, and we were fine-tuning models, and we were still doing some fancy stuff, but it wasn't like we were doing anything that would blow anyone's minds. But it still just worked. Going back to that, I think generating prompts and working on that — starting with a deep understanding of the customer — will get you so, so far. Lukas: I think you're the poster child for prompt engineering. And certainly some people think that machine learning goes away — or takes a backseat in the sense of training models — and there's this new role of prompt engineer that uses these models for some purpose. My background's in machine learning, but I'm open-minded to different paths that the industry could take. Do you think that in your world, machine learning, technical ability even matters at all? Do you try to hire machine learning people themselves? Dave: I think it does matter. I think a lot of this stuff — over long enough time horizon — gets commoditized, and it matters...I don't know if ""less"". The thing that used to matter, technically, are probably solved much more now. We've got a really strong internal AI team that's full of really smart, what I would call, ""AI people"". Saad's putting together an awesome team there. We want to have a ton of those people that can just help us, can develop moats and develop IP, and again, solve customers' problems in a deep way. I think it does matter, I think we'll always have that. I want to have a 500-person team full of ML engineers and AI people doing all of that. But I don't want to be driven by that. That's always a byproduct of the thing that we're trying to do. If we can solve customers' problems using all that stuff, then we're so much better off for it and it definitely gives us an advantage. Saad, do you have any thoughts on there you want to add in? Saad: No, I totally agree. Essentially what we're asking is, the customer wants something and the customer has a vision or view, or maybe they're trying to discover a new idea. There's this ideal output out there somewhere that would make them happy, and delight them, and give them a lot of value. Between their input and then the ideal output for them, there's choosing the AI system. It's not necessarily one model, it could be a number of models. I think that's where the AI team plays a role. Like, ""What is the right system? What is the right base?"" I think you're right as well, prompt engineering's going to play a huge role. As Dave said...Dave's a perfect prompt engineer, and is somebody loves the customer, and is willing to iterate through n-number of cycles to find the right prompt. Just an interesting point there as well, there's this idea of expertise. An expert is somebody who's learned something and knows it from experience. I think one of the really interesting and fun things about a lot of these models is that even the people who made them aren't experts on them. It often happens, our R&D center will come out and give us a model, and we have to test it, and they'll tell us all these things about the model. Within a few minutes of us testing it, we've already falsified a lot of those assumptions. Nobody's an expert in prompt engineering. It just takes a love of the end use, and customer, and the product, and then just willing to be patient with it. Lukas: I guess one of the things that seems like it might be hard — putting myself in your shoes — is actually quantifying if your models are improving over time. Is there some way that you know even, Dave, that as you iterate, that they're better, besides just eyeballing the content that's getting produced? Dave: Yeah, early days it was eyeballing. Sometimes we'd — in Slack — just try to pop in two screenshots, ""Hey everybody, vote on one of these. Which one do you think is better?"" But again, all this would happen in my head a lot, where I would just keep cycling until I could feel it getting better. Just my own expertise there, it was nothing scientific. And even a lot of stuff we didn't test. Once I found the right setting on something, I probably wouldn't even test around and find the optimal one, it was just locally pretty good. Then I would release it to customers, and I think anecdotally they'd share feedback whether stuff was getting better or worse. We would track things, ""Are they favoriting it? Are they copying it to their clipboard?"" That was a signal that was even maybe stronger than favoriting it. And then, are they using it, other places in the product? We started to track those as real signals there. It is funny, especially early days — this probably happens now — but a lot of people's perception can really sway a whole community. We'd have somebody complain about a template, and they'd say, ""Something changed in the last five hours, it's totally different, it's way worse. Please revert it. Dave, this is getting worse."" You had a bunch, ""Yeah, set the change unchanged [?]."" To the best of my knowledge, nothing had changed, nobody was touching that, nothing shipped, nothing happened. But people just pile on that as a way to highlight their frustrations with it. And the same thing reversed, people would be like, ""Hey, anybody notice that the paragraph generator's way better now?"" Again, nothing had happened since before...all those people piling on like, ""It's so much better, I love it. Thank you for all you guys do."" I think companies can get a lot of mileage out of frequent improvements and having a culture of improving, because everyone just always assumes everything is improving all over the place. You get the benefit of the doubt over things you haven't even touched. Like, ""Man, everything's just getting so much better all the time,"" because they've come to see that that's a general theme in our company. If we slow down and stop, again, I think it atrophies. The customer trust corrodes, and they start thinking more stuff is worse than it really is there. It is hard to quantify, but a lot of it is just customers sometimes just feeling like things are getting better, and they're being heard, and they're seeing improvements there. They'll give you a ton of benefit of the doubt from that. Lukas: Do you feel like you benefit from improvements in GPT-3? I've heard different things from other people. Some people seem to feel like GPT-3 will launch a new thing and it'll break all the prompts. Other people tell me that it's actually much better than it used to be. What's your experience? Dave: Yeah, I think we get benefits. Even I tell our team, ""Let's do a lot of our own stuff and have our own IP, but if OpenAI is going to just do all this free work, and then just push it to some API endpoint, and then now you've got all this new functionality that takes us 20 minutes to test it and implement it, let's use that."" Let's always be sure that we're testing all this new stuff — that they've got 200 people building for us — and not just rely on our own stuff. It's hit-or-miss. Not everything that they roll out is better. We A/B test pretty much every new model and update that they make. We'll fine-tune our own models, and it's definitely hit-or-miss whether those are even better. You got some of the best people in the world working on this, and they'll be super excited about some model, we'll test it and be like, ""This is actually performing worse across the board. We're not going to roll this thing out there."" It is interesting how that all works, but we definitely try to use all the stuff that they release. Lukas: Do you find fine-tuning useful? Again, some people say that prompt engineering makes fine-tuning obsolete. What's your opinion right now? It's November 7th, 2022. Dave: Yeah, this will be obsolete November 8. Generally, we find it helpful. I think a lot of what I worry about in the space or try to allocate...it's allocating resources correctly here. Where you're not doing all of this work that then just gets obsolete out of the blue. It's like, ""Oh, we just spent all this time fine-tuning all these models. And oh, this new model makes fine-tuning obsolete."" That is so much of what we're trying to do, just figure out like, ""Where's our space, where's our special secret sauce that we can go and implement there?"" We've got different fine-tuned models running, but even as I say that, I'm thinking like, ""Man, I don't know when the last time we tested just a new model that might very well be better than an old fine-tuned model."" This stuff changes so fast. I think a lot of what we think through is just being sure we're always going back through the whole system, and updating with new stuff, and testing it with new stuff. Lukas: Another question that's probably not going to age well — but I'm curious your current take, as far as you can tell me — is, how do you think about the different LLMs out there? You're famously using GPT-3, but I'm sure you've tried BLOOM and maybe other ones. What do you think about all the LLMs out there? Dave: We use GPT-3 primarily but not exclusively. We've had other stuff going on and we're always testing new stuff there. It's funny. I think I get down...this feels almost like an Android/Apple conversation, where you get down into the weeds, and people that really are in the know are like, ""Oh yeah, GPT-3, not even top 5 anymore."" I've heard people say stuff like that, ""No, this one's better, this one's better, this one's better."" I just don't see that bubbling up to being a more user-friendly model or really doing things. I still feel like GPT-3, from my perspective, is generally far better than a lot that's out there. I don't doubt that. We've seen this ourselves too, there's things that can do specific things better. But I think by and large, GPT-3 still reigns supreme in my mind, and most people producing high-quality content are using that primarily. That's what I'd say. I don't know, Saad, what do you think? Or even Lukas, what are your thoughts? Have you seen that...I'm always asking people this question, ""What else is out there? What are you seeing?"" I think a lot of people that just [?] kinda really know, they're like, ""GPT-3's pretty dang good."" Saad: Yeah, this is almost like a puzzle of three black boxes. You have the black box of, ""What does the customer want?"" The black box of, ""What do the models do?"" And then the black box of everything in the middle. And like, what are we going to do about all that stuff? I think customers want different things for different use cases. For blog posts, maybe they want to have something that's more semantically complex, the language is richer. If they're doing product descriptions, maybe they want to make sure the facts are preserved, and is more domain-specific, and that is able to speak to and sustain the data and the specifications that was in their initial product descriptions. What we're finding is different models have trade-offs. Typically, when you increase in semantic complexity, the same processes which get you that also get you to break down facts. It gets better at representing what looks like facts, but it might actually be lying way better. And so you can't even tell it's representing facts in a false way. Some of these models are a little bit more literal, but they're not able to be very semantically complex or speak like Shakespeare. GPT-3 is great, but it wasn't trained on a lot of foreign languages, whereas BLOOM was. I think ultimately, it's about, ""What does the customer want for that specific use case?"", and then, ""What is the best way to get there? What are the approaches and processes?"" It's not necessarily one model all the time. Maybe it can be a combination of models, like an adapter model with a base model. And then how do you tie together those initial models? It's almost like a menu of options to get the best output for the highest efficiency. I think it's more of puzzle pieces, you'll always have these three variables you're dealing with. Lukas: Well, sure, but at this moment is there another model that you generally use? Or is there a number two that you would point to, other than GPT-3? Saad: T5's really interesting for some of its instruct capabilities and its ability to be fine-tuned for very specific things. And indeed, you could say there's a hypothesis that a really good AI architecture will always have two things. A really generalized model that's powerful in probably semantics, which is one half of the thing about language. And then a secondary model is either really good at instructional, like specific instruct, or at some sort of fact adaptation. Because one model will become slightly more complex at the cost of sustaining facts, whereas another model can preserve facts at the cost of not being the most semantically complex one. This is a hypothesis. We think we'll probably end up finding a lot of these different pairs to get the best of both worlds. Lukas: Interesting. Maybe overall — at this moment, November 7th, 2022 — do you have a characterization in your mind of what large language models can do and can't do? Where do you feel the limit is? Is there a type of content that you feel like you couldn't create well — given the current state of things — that you might be able to do in 2023 or 2024? Dave: I can bubble up a few big customer complaints. Perhaps the biggest one is factuality. It'd be one thing if they just always had incorrect facts, and then you see a fact and you go correct it, but it lulls you to sleep. Because it's like ""Man, it knows a lot."" All of a sudden you start trusting everything, and then you don't want to look up every fact, because we've seen four that are totally right, but then it'll just say the opposite. I remember one time I was asking who won the 2021 Super Bowl. I forget what it was. Basically it was the right teams, the right score, the right location, the right date — I looked all that up — but it actually switched the teams. The team that lost, it said won. It just lulls you to sleep, because it all looks pretty good and probably passed the sniff test. And then you realize, ""Oh man, I just shipped the exact wrong answer."" That's just a big thing that we've got to figure out how to control for, how to identify. How to be just a bit more truthful, I suppose. Another one is just, obviously, getting it to follow instructions. It tends to repeat itself. If it hasn't quite picked up the pattern or the instruction, it just thinks you want it to say the same thing over and over again. If our customers set that pattern — let's say you do it twice — and then you keep trying to write...now it's going to do it even more, you just reinforced this. It spirals outwards, where it's like, ""Well, why is it always saying the same thing over and over?"" It's because you set the pattern at the beginning that makes it do that. Or you misspell one word in an intro paragraph of a blog post, and then you realize that Jasper misspells it the entire way. You're just like, ""What is happening here? It's a common word."" It's like, ""Well, you kind of misspelled it, and so Jasper thinks that's how you want the word to be spelled."" There's definitely a lot of just steering content and teaching people. I also think people just don't have an understanding or a grasp of what is happening. Fundamentally, what are these models trying to do, and how do they respond to certain things? There's just not anything anyone's ever had experience with before. Coming in, we're not just teaching them how to use our product, we're trying to teach them, ""Fundamentally, here's even what AI-generated content means. Here's the limitations of these kinds of models, and here's what they're really good at."" We're trying to teach all of them that in a very simple way. Saad, have you seen anything that you feel like we can't do? Saad: Yeah. Just the way we evaluate and think about models is, you have a X/Y axis, you have semantic complexity, and then you have domain fit, which has a lot of different features like factuality. Then you have a bunch of additional capabilities like multilingualism and so on. We also pay a lot of attention to instruct, can the customers get what they want out of it? In terms of semantic complexity, I think it can probably end up doing everything, at the end of the day. There was this article about ""Does Moore's law apply to generative AI?"" I think it does, but not really. It applies for semantic complexity really well, but unlike Moore's law, which continues on forever, there's diminishing returns. Eventually you'll start hitting an asymptote and level off, because what does it mean to become infinitely good at semantics or language? Humans have a limit, you can't really beat that. What would it even mean, for a customer to go above that? I think it'll actually get really good for semantics. I think there's a lot of limit around domain fit and factuality. There's actually an aspect of it that worries me a little bit. I would really be worried if anybody using a generative model started using it to get advice, legal advice or medical advice. It speaks to what Dave was saying about factuality. It's actually getting better at lying. It looks like it's more factual. Like if you ask it for legal advice, some of these models can cite legal papers and even come up with fake court cases, but it's totally made up. It's actually not just a limitation, but almost a risk. I think our community is really wise to not use it for that, but you'll see this get better and worse at the same time, it'll get better at representing citations in a really strange way. I think for domain fit and factuality, we actually have the perfect tool for factuality. We've always had it for decades, it's copy and paste. The question is, if we want to increase factuality, are we able to bring in a database that has the facts that the user cares about? And then have those stupid models... I call it stupid factual...you have stupid factuality, smart factuality, and false factuality. Can we replicate stupid factuality with one model, and then have another model be semantically complex, and bring the best of both to the user? I think the limitations are around factuality, as Dave said. I think everything else, though...in a way, it's not that the sky's the limit, but the users are the limit. We'll able to accomplish what a lot of humans can accomplish in [?] language and also semantics. Lukas: When you say ""semantic complexity"", can you give me an example of what you mean? What would be a very semantically complex thing to say? Saad: Yeah, let's do two examples. One being a Tweet and then one being the longest form possible. A Tweet, if you say a sentence like, ""The dog barked."" Or let's just say you say a sentence like, ""The dog barks."" And you say, ""The dog likes Jurassic Park."" And you say the dog likes Jurassic Bark."" The last one is a pun. For the model to know, ""Hey, write me a funny joke about a dog,"" it would have to know that ""bark"" and ""dog"" is related, it would have to know that ""Jurassic"" Park was a movie and then you can replace ""park"" with ""bark"". There's a lot of semantic complexity going into that sentence. You're getting higher density of meaning within shorter tokens or word count. Whereas the first sentence is, it's almost the same word length, but it's less complex. Semantic complexity is the ability for it to have different layers of meaning within a given space. In terms of the longest form, think about a play or something by Lin-Manuel Miranda. You have questions of plot, where you're getting the end of the play to refer to something in the beginning of a play, or different paragraphs referring to each other. If you imagine the words being linked to each other, it's like you have more links between words and between paragraph. That's semantic complexity, it's more dimensions to it. Insofar as these LLMs, large language models, are predicting the next word in a string of tokens, you can see why it's hard for them to accomplish this. But at the same time, why they mathematically can end up doing so. Lukas: It's interesting, I feel like I've spent a little more time with DALL·E maybe because my daughter loves DALL·E. I feel like there, we have such basic problems. We try to get it to draw the mom with black hair instead of blonde hair, and it drives my daughter nuts, and actually my wife nuts too. That just seems like such basic semantic understanding of a set. It'll often take a different person in the scene that we're trying to describe, and give them black hair instead of the mom. I'm curious, do you think there's something different about image generation? Because it doesn't seem like it has very much understanding of what I'm asking, at least in that domain. Dave: I think image generation is interesting. Obviously it being so visual and instant, that it's really easy to synthesize the whole thing in half a second. Where I think if you had Jasper write a blog post, it's like, ""Is this a good blog post? Is it what I wanted?"" It's going to take me two minutes to figure that out, do all of that. There's something...I'm sure there's a lot of weird stuff happening. And obviously, text generation's been around longer than image generation. This image generation will probably be super easy and awesome in a year or 72 hours. I'm sure there's weird stuff happening that's just harder to see. It's harder to see that, ""Oh, it gave the wrong hair color to the wrong person, or it gave the wrong conclusion to this thing that I thought it did there."" It also seems like image generation, you could say, ""Don't paint this car pink."" What's it going to do? It's going to paint the car pink because it doesn't know that ""don't"" and ""pink"" are tied there. I think image generation prompting still feels like much more dumb than text generation. I assume it's just the state of the technology being earlier, as opposed to maybe something being more complex there, but I could be wrong. Lukas: Interesting. I'm curious — and I'm not here to grill you at all on your business model — but I feel like I have to ask. You made this awesome business, it sounds like, in a few weeks of effort at first, and it just took off. How do you think about defending your business? Don't you worry that someone might come along with a similar approach? Or maybe they find something that's a little bit better, somehow, in such a fast-moving space? How do you stay ahead of that? Dave: Well, I don't think we made an awesome business in a few weeks. I think we made a crappy MVP for how to do Facebook ads in a few weeks. And then I've spent every day since then building out all the other parts of a scalable, repeatable business. But no, it's a super valid question. I think I've spent a lot of time just thinking about moats over the last 18 months. What's real, what's not real. I'm looking at the B2B companies, like ""Where are really moats?"" Obviously you've got...people think moats, they think maybe NetworkFX, or you think Uber going into a city, or you think Amazon having warehouses everywhere. Things are so structurally obvious, but then you also take maybe HubSpot or maybe Adobe, and it's like, ""What's the moat there?"" It's like, ""I don't know. They knew a customer, and they built a good team, and they had good culture, and they maybe got a little lucky and they kept executing over and over, and they had a second product, a third product, and a fourth product."" I think in B2B, that's probably far more common than this Amazon example. Where you just end up building a good company that can continue to execute at a fast pace, and knowing the customer deeply. I think you've got moats like brand, you've got moats like community, you've got distribution. We want to have all of those, but we also want to keep developing strong product and tech moats too. I think at some level that means we've got to have a continually improving product, and we've got...something where you end up having so much product built. Maybe none of it's hard to build, it just would be hard to build all of it. And by the time you built all of it, if you're a competitor, I'd be gone too. But I think when it comes to our AI, yeah, there's a ton of different differentiation, just around the models that you use. We want to be really nimble. We're always building in such a way that we can replace everything wholesale very, very quickly. I think a lot of companies maybe are going to get stuck on some old model or some old way of doing things, and that's going to be the death of them. We also realize that we've got a really unique dataset that our customers are giving us. We're seeing how they use it, and they're generating all sorts of content that nobody else in the world has. To the degree that we can use that to go and make models better, any model — OpenAI's, this new one that comes out, the new one that comes out tomorrow, whatever — we would be able to take that dataset and very, very quickly fine-tune and train those models to be great for our customers. It may not be great for anybody else's customers, it may not be great for any other use cases, but it's what our customers want, and we've got a good inside track there. All that being said, I think moats are something to be worked towards. I think there's a lot of pieces. I don't think Jasper, or almost any other company in B2B will live and die on one perfect moat. It'll be a combination of six, seven, eight different things that make it hard to do it all together. Lukas: From a technical perspective, are there things that you do to stay on top of the generative AI space broadly? Dave: I'd be curious to hear about you, Saad, but I see a lot of it on Twitter, to be honest. Where do you go for breaking information every 10 minutes? It's Twitter. By the time it makes it into some newsletter roundup, that stuff's obsolete now. A lot of it's just finding and curating a good Twitter list of people that are just in the know and all of that. Conversations with other founders like yourself or other stuff like that, I hear a ton there. It feels the only way to stay up to date is to really get all the way in, because no one's going to curate it and spoonfeed it to you. And by the time they do, you'll have missed it. What do you think, Saad? Saad: Yeah. Before I started at Jasper, I called up one of my mentors who was running a bunch of R&D laboratories and research processes. I was like, ""How do you become a successful R&D leader?"" He was like, ""Well, you're probably never going to beat everybody else in the world at everything, because you have the whole world and they're all researching and coming up with the best stuff."" He's like, ""Definitely stay on top of that, but make that a small percentage of your focus. Find the one thing that you can be the best in the world at."" As Dave said, we want to be the most customer-obsessed company. We want to understand what we can do with customer data. It'd be great if a customer could say like, ""I thought it and Jasper got it."" They go on from their idea to something that's in their hands — some content created — in the fastest, easiest way possible. I think that that customer data, being able to take the best of the world's R&D and saying, ""Hey, we're coming up with this new model that can be fine-tuned in this many ways. You have these new prompting techniques, you have these new base models or methods to hook on adapter models to get better outputs. If you want to take all of that..."" The one big thing we want to do is find a way to use our customer data to get more customer fit, I think. And that's a big deal. Like I said, models are either going to get semantically more complex or they're better at domain fit. I think that's almost the whole second axis, that we can be the best in the world at. Lukas: From a hiring perspective, do you actually try to hire experts on generative AI? Is that even possible? Would anyone pass that bar at this point? Saad: It just goes back to this word of what does expertise mean? The paper on Transformers came out in 2017. We've gotten tons of amazing applicants who have a lot of AI experience. I think that's the question. Is there an expert in generative AI? We're all learning these things together. Even the R&D centers — these world-class folks that come up with a model — they don't even know what the model can really do until they test it. I mean, it really is a black box. I don't think that this idea of super explainable AI can apply to this field super quickly. So, what does it mean to be an expert? I think what we're looking for is people who are obsessed with customers, who are fantastic problem solvers, who are creative and able to navigate this uniquely interdisciplinary space. You have to be really good at the AI and the data science. You also have to be really good at language and you have to love the customer. And it's a pretty rare mix. It's like the book by Walter Isaacson, ""Innovators"". The people who are good at the art, good at the science, and they have the customer obsession. I think those are probably the three right ingredients. We've been lucky to get some really great candidates along that line. I think it makes the space uniquely interesting. It's definitely not boring. Dave: We want to have a pretty diverse set of experiences there, because you just got to be tapped in broadly. I mean, I said this earlier, but I think it's worth saying again. I have never worked particularly well with people that come in an interview or something, say, ""I really want to use this technology. I really want to use Terraform."" Or whatever it is, I don't know, it's just never worked out. It tends to be, ""Hey, we got this hammer, we're just walking around looking for nails all the time,"" instead of just realizing, ""Oh my gosh, a screwdriver would've done it so much simpler, so much faster there."" We tend to shy away from folks that just want to do some cool technology. But if they get excited about that as a way to solve a problem, then that's huge. Saad: That's actually a super good point, Dave. We've been talking a lot about fine-tuning. When people imagine fine-tuning, they think about the most complicated things first. I don't know if this is a trade secret or something, but there's actually so many simple things you can do to get major uplift. That scrappiness, it goes a long way. Lukas: Wait, give me an example. You can't just say there's so many things. What's the first thing you would do? Saad: For example, even with prompting...right now, when you think of prompting, maybe you think about a user putting in a prompt. Or maybe you think about some backend prompting that a template has, and then a user interacts with that, and then it sends it off. I mean, just a simple thing too. Like, if you have a store of what you can call context — a series of pieces of information that the customer cares about. It could be their voice. It could be a list of their products. It could be their customer's voice. It could be an example of a customer review that somebody left them and was really good. It could be a speech that maybe one of the leaders gave — you could actually just concat up various pieces of context, and then have a prompt, and then get a really cool... If you think of the generative model as a remixer, ""Remix these various pieces of context and give you something new,"" it actually works really well. It's nothing super fancy, you're not fine-tuning the model, you're just doing really clever prompting. I've been showing our business dev guy, our business pod leader, and he's been really impressed with it. It's just these little hacks, there's thousands of these things. Dave: And even to simplify it...this is Dave-level. Let's say I wanted to make our paragraph generator 10% better, maybe measured by...it gets copied to the clipboard 10% more often. There's obviously a bunch of ways to do that. One way that somebody that really wants to do AI, would probably come to it and they'd say, ""Oh, we got this T5 thing and we're going to spin up our own infrastructure, and all of this stuff."" Probably take a month and a half, but we'll get this thing and we'll fine-tune it on our past customer data. Yada, yada, you do all that, and whatever, it could probably work. You could also...for us, we got the stilt in the way that non-technical people can do it. I could go adjust the temperature from 0.7 to 0.4, and we might find that, ""Holy cow, that actually produces way better paragraphs and nobody had ever even thought to test it."" And that takes six minutes. Now the customer gets the 10% improvement either way. Do they care about the T5 version, think that's so cool, and so amazing, and awesome? No, they're just like, ""10%? I'll just take whatever one you give me, I'm trying to write this blog post so I can get home to my kid's baseball game."" We're always pushing ourselves to be like, ""What we care about is the lift. What are all the ways we could do that? Let's start with the simplest one first, opposed to just playing start up or playing AI, or doing just whatever new cool white paper came out yesterday."" Saad: Yeah. And just to refine that even a little bit more, it's actually really surprising sometimes too. We did this experiment where we actually, we had two models and we actually thought...let's just say model A and model B. We thought model A was going to win because it was better, and more semantically complex, and all that type of stuff, but our customers liked model B better. We thought about why. B was a little bit more wordy, more flowery. If you're an English teacher, you wouldn't have liked that, you'd like model A, but our customers really liked model B way better. It dawned upon us that the customers are using the content outputs much like a sculptor looks at a big rock. They're actually trying to get something that's easy for them to delete from, rather than something perfect that they want to add to. I think the models are complicated, it's really interesting, but the customer is also very interesting, and we don't fully understand exactly what they want and what they like either. Being able to focus on that using these hacks is the way to just understand the customer better, faster too. Lukas: When you think about your R&D budget today, is it more 90% prompt engineering and 10% machine learning and fine-tuning and this fancy stuff? Or is it more 90% fancy stuff and 10% prompt engineering? How would you describe where you put your investments right now? Saad: I do view them tied together. For us, it hasn't been such a big divide. We'll get the new model or we'll come up with new adapters for a situation, and we then have a black box. The question is, ""Will this black box be better for our customers, the same, or worse?"" We definitely put it into an A/B experimentation situation and we start running it through numbers of tasks. These tests can involve different prompts, it can involve different configurations. We have a bunch of internal metrics we run against too. Everything really just represents our hypotheses where we think the customer will like more. Will they like more varied sentence structures, longer stuff? Will they like something that's more on topic, or so on? I think it's all a part of the toolkit, and the right percentage is the one that results in the biggest uplift fastest. I'm sure it'll always be moving around. Prompt engineering plays a huge role right now, but we know that a lot of customers have asked for more domain specificity. So, that's an area of research where a lot of our R&D budget is going to, but once you overcome this initial hump, then maybe we'll go back to prompt engineering again. Dave: Yeah, it's probably less prompt engineering, as a percentage, than you'd probably think. Less than 50%, maybe a lot less. Yeah, I don't know. I think there's definitely some diminishing returns, where you play around with that stuff, and you get some great gains, and you keep trying, and you can't really get anything else to really breakthrough there. [?] things here get you pretty far pretty fast, but at some point you've got to do a lot of the outside stuff to just keep getting big improvements. Saad: Yeah. Lukas: Is it hard, as you scale, to institutionalize the things that you're learning? I picture your company running all these different experiments, do all these different things to make customers happy, but does every new hire have to come in and learn on their own all these things? How do you keep track of all these insights that you're having? Dave: That is a challenge. So much of what I try to do all day...just for context, we've gone from 10 people at the beginning of the year to 150 people now. Lukas: Wow, that's incredible. Dave: It's a ton of just...you're trying to find somebody that knows what happened three months ago. You ask 10 people and you can't find anyone that was even here. I think that's been a lot of the work, just trying to give people context over and over. Try to point people to past things that were done or past experiments that worked. Luckily, I mean, Slack is a pretty good record for a lot of that. We've got this channel that's a ""shipped"" channel. Anytime something gets shipped to customers, it would go in there. You learn a lot just by scrolling back through that and seeing all the different winning A/B tests, we always published that. We'll even try to write that in a way that's customer-facing, just to fully wrap your head around like, ""What are we trying to do here?"" Put it in a way that the customer would appreciate. Don't just talk about latency, talk about like, ""Man, now our customers can generate 18% faster,"" and yada, yada, yada. I think a lot of it's just [?] call people back to the past, what we've done, and aggregate that in the best ways that we can. But we have not found a really easy way...we don't have a super cool training course on all the insights that we've had. I think we're getting better at trying to aggregate those and make sure the right people have them. Saad: I know how you feel about this, Dave. I think remoteness has a lot of benefits and has a lot of challenges as well. I think a lot of folks pre-COVID are used to just getting into a team room, and having a sprint, and you learn through that sprint. It's just different in a remote setting. I think just the world is getting used to how do you apprentice in a remote setting? We definitely try to simulate that with offsites and then getting to meet the team. I'm not saying it's a challenge, but it's definitely something we're learning about as we go. Dave: Yeah. You guys fully remote, Lukas? Lukas: We're basically fully remote. We do have a headquarters in San Francisco, but our meetings are generally remote-first, and we'll hire people in any geography. Dave: Yeah, I definitely think going from 10 to 150 would be easier in-person, but that doesn't mean that the fullness of the company...I think that there's outsized benefits — over a long enough time horizon — to being remote. But it definitely feels like this early forming phase, trying to get knowledge in the right spot. It's tough remote, I think. The initial team was all in the same room. I think it'd be very hard to find product market fit and have the year we did out of the gate, if we were all just remote at that time. Lukas: Yeah, on my end I've appreciated some of the discipline that remote work forces you to do. I think we write a lot more down and you keep better agendas, and records, and things like that. For me it hasn't been all bad, but I think there's so many different ways to run a company. And I think different teams even, some prefer to do lots of onsites and some don't care at all. Dave: I do love that it forces you to just think more clearly, communicate more clearly, plan ahead so you're not just always putting out fires throughout the day. I really think it does do some really good stuff there. Lukas: I'm curious...we don't intentionally try to invite only customers on the podcast but I think in the end we typically are talking to people that are customers, usually. And you guys actually aren't a customer of Weights & Biases. I wonder, if you were in my shoes, would you be worried? Do you feel like there's this big trend happening that undercuts what Weights & Biases does? I think, Saad, you're probably more familiar with what we do, so maybe I'll let you answer that question. And I promise I won't be offended with any direction you want to take that. Saad: First of all, Weights & Biases is a great company, a lot of friends are there now. And just thank you for inviting us, even though we're not customers. Correct me if I'm wrong, but I feel like Weights & Biases, it increases in value as the customer has more and more models. It is essentially a thing that scales in value as the number of model scales, is that right? Lukas: I think that's what customers tell us. Yeah, for sure. Saad: The jury's still out in terms of how this generative AI space will shape up. But I can see some companies developing in this space that are mono-model companies. They just have their one model and they're specialized in that one model and its use cases. So obviously for that, that'd be a challenge. I could see other companies though, that they have maybe a few big base models. These things are pretty huge, you don't want to be fine-tuning a 100-billion parameter model all the time unnecessarily. Maybe these companies have two, three, four bigger models, but they have tons of adapter models or they have some small models just for different things. I can see Weights & Biases being ultra useful for that. I think overall the answer to your question is no, it'll still continue to be really useful. I think how people think about scaling models and when and how is it viable, that might change. I think we're still learning what the best architectures are that'll sustain in the space. Lukas: Okay. Well, usually with the nerds that we have on this show, we end with two questions. I might slightly modify it for you guys because I have a slightly different version than I want to ask you. Typically, we ask if you had more time to research a different topic in machine learning, what would that be? But I want to ask you all, if you had a different domain that you think these models might apply phenomenally well to — where there's no one like you who's come in with that customer empathy — what do you think is ripe for disruption, with these generative models? Dave: I think about this lot, and I think we've seen...we just did our big Series A announcement, there's probably just an army of clones gaining strength in the corners of the universe right now that will all pop up in the next two months. They'll be just very much like Jasper, and even probably good products and all that. That's a bad way to do it. You don't want to compete against us at our game. We have to be really good at our game, but we have one of a million games. What I would encourage people to do is...this could be done anywhere, and you could take this and put it in any subset of any industry. You could do legal stuff, you could do stuff for doctors, you could do stuff for different teams and companies. If you think about CRMs now, CRMs over 20 years...I've got a friend that just started a cleaning company, a local cleaning company, and he was like, ""It was so easy to start, there's this CRM that just does everything for you."" I was like, ""Well, was it HubSpot or Salesforce?"" He's like, ""No."" It's some rando thing I've never heard of in my life, that's a big company that's just a CRM for cleaning companies. That's where this goes to. You take your end user, you understand them deeply, you take all the noise, you simplify it for them, and you give them a product that just does what they're trying to do, better. Anyone could beat Jasper by going deeper in any little vertical that we're not fully focused on there. Anyone could do better at something that's a little bit more specific. This is true, I guess, of models too. You could get a better model if it was just more specific there. That'd be my encouragement to people, maybe just the community at large. This is all so cool, but there's so many people that would be thrilled to use this technology, if you would just package it up for them. I think the key is to not try and be the next Jasper or do exactly what we do. It's like, take the essence of it and then go for a different customer segment. There's so much opportunity out there right now that's just completely untouched and nobody is trying to build for. You go find a community and do that, you'll be in really good shape. Lukas: But do you mean making marketing copy for lawyers, or do you mean doing some specific other thing for lawyers when you say that, just as an example? Dave: I mean, it could be any. I don't think marketing copy as much. I'm thinking...well, one, I would just want to talk to lawyers, like, ""Hey, can you tell me about what you write all day?"" That's where I would start, since I'm not a lawyer. But I'm guessing it's a lot of explaining in short form to customers over email what that document means. You can just build a quick summarizer that just hooks into Gmail and spits that out really fast. It could be you training up paralegals to understand more stuff, so you build a little tool that helps them synthesize documents, and explain it to just train up people internally. It could be generating some boiler-plate content. Maybe I talked to a lawyer and I say, ""Hey, show me how you put together a document."" Maybe it's a lot of them going to Google and searching for boiler-plate stuff and adding it in. Maybe it's similar to how engineers write code, a lot of it's ""Watch them go to Stack Overflow and copy and paste,"" and now Codex or whatever's framed all that up. Again, I don't know because I haven't spent the time in there, but I think there's probably a ton of opportunity that I would never even think of. That if you just spend time with a group of people, you'd see a lot of opportunity to do a lot of things, because you'd be like, ""Oh my gosh, this model from two years ago does that out of the box. We'll just spin that up for you."" That's more of what I'm talking, than thinking, ""Oh, marketing copy for this niche there."" Saad: One quick thought here — and it's just a funny point — I try to read all of Jasper's churn notes. Like, who's leaving Jasper and why. There's a funny demographic of students who use Jasper. Maybe they're using it to do their homework. I'm not sure, exactly. I think some of them are. That combined with another insight of...the education sector is one of the hardest sectors. It's so hard to innovate in education, especially technology-wise. This is less that I think is a good idea, and it's more of just sending good vibes to whoever tries to use generative AI in education. Students seem to be using generative AI to do different things for their assignments and homework. I think it'd be really interesting if somebody did a TaeKwonDo move, and took some of the capabilities of generative AI that would make a student want to use it...maybe for something that's like cheating, like they're doing their English essay on a generative AI app, but actually combining that with pedagogy so they're actually still learning through that. Super hard to do these sort of startups, but wouldn't it be cool if you could have procedurally generated lessons for students? Just using this and making it really fun. It's a hard space. I'm not saying it's going to be the next unicorn or anything like that, but if somebody can succeed at using this in education...it's just impressive, if somebody pulled that off. Lukas: Well, final question, we usually ask what's the hard part about making machine learning models work in the real world. But maybe for you, I'm just curious more broadly; making this company, making this product that people appreciate so much, what's something unexpectedly challenging about making it work? Saad: I think it just goes back to this insight that people are more complicated than AI. What their UX preferences are, how they want to use it, how to simplify it for them...solving the black box of AI is maybe tens of thousands of permutations. But solving the right design for the human, there's infinite amounts of permutations for that. This is new for everybody, and we are still learning how people use this. Maybe it's obvious to Dave and everybody else, but I was really surprised to realize that customers want a lot of text, then they want to delete it, and find the gem inside of that boulder. I just don't write like that, but it was fascinating to see that's how our customers do. There's probably a thousand other things we learn about people and how they use generative AI. It's infinite, what we can learn about people and how they want to use this. It's hard, but inspiring at the same time. Dave: I was thinking, something that's hard...this is a step off of the actual tech, but it's the community aspect of it. Arguably, our community has been one of our biggest advantages. We've got customers with tattoos, and we're just all riffing in there, and it's just a lot of fun. But it's like, ""Man, it's exhausting."" I've had the whole community turn on me. All of a sudden Instagram's blowing up, people are pissed, people canceling and leaving. It's like, I'm the bad guy, I've got to go in there and save it. It's emotionally challenging to be connected to people in such a meaningful, or powerful, or exciting kind of way. It's work. There's times that I really have to get hyped up to like, ""Oh, I'm going to go and do this."" But to me, it just feels like table stakes. If you're not willing to do that, it's okay. Maybe find a customer base that you would be willing to do that with. Because it's such a valuable part of the game of just building a company and building great products. If you find yourself avoiding or not wanting to spend time with your customers, then it's just going to be so hard to do the rest of it there. Communities are a lot of fun and high stakes. So fun when they're going really well, so miserable when they're not going well. But it's really, really worth it. The benefits are immense. Lukas: You know what's funny, Dave? The stock entrepreneurship advice that I always give people — and I haven't heard anyone else say it — it's exactly what you said. Which is, find a customer that you like because you have to spend so much time with them, and you'll just be so much happier. It's funny- Dave: Totally. Lukas: -I think both of our companies, we've had the identical approach of just approaching a really specific customer type with a clean sheet of paper of like, ""Hey, what do you need?"" It's interesting to learn that from this interview. And I mean, congrats on such a vibrant community and such a well-loved product, that's so cool. Do you want to brag about any stats? I mean, everyone says Jasper's one of the fastest growing companies of all time. Is there any numbers that you make public that you'd want to tell the world? Dave: I think what got public was year one. By the end of the year we got to $35 million ARR. And we only had nine people, I think, at the end of the year that were doing that. Lukas: Pretty good. Don't tell my investors about that, please. Dave: Yeah, totally. No, it was pretty good. It was a mix of luck, and a mix of right time, right place, and the right team. That's one of those early eye-popping things. I think two months into it, we added a little over $3 million of ARR in a three-day period. Lukas: Oh my god. Dave: We launched this new product. I was just hyperventilating. Like ""Oh my god, I cannot believe this. I spent my whole life failing."" And you finally hit one. No, I mean, I think I am just as shocked and grateful to be a part of this as the next person. It's been a wild ride, and we're trying to just stay humble and stay focused. For the first year, we didn't hire anybody. We didn't have any meetings, we didn't do any investor calls. It was just so simple, almost this Garden of Eden of startups. It was just us hanging out with our customers all the time, and trying to build some stuff that they wanted, and that really worked. As we scale, we're trying to just keep that ethos and infuse that throughout the rest of the company, because there's something nice about simplicity — and something essential about simplicity — as you scale. Lukas: Absolutely. Well, thanks so much. Super fun interview. I really appreciate it. Dave: Yeah, you bet, man. Appreciate you having us on, this has been so fun. Saad: Thanks for having us. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So, check it out.",12226 +Shreya Shankar — Operationalizing Machine Learning,https://www.youtube.com/watch?v=LdMydLBDgEQ,3278,2023-03-02,"Shreya: We found that the velocity really matters a lot. The ability to validate as early as possible matters a lot. You don’t want to push a bad model to production. You don’t want to wait until your final stage of A/B testing, in order to find out that something is not going to work well. So we found that the earlier you can validate, the better it is. Lukas: You're listening to Gradient Dissent, a show about machine learning in the real world. I'm your host, Lukas Biewald. Shreya Shankar was an ML researcher at Google Brain, and an ML engineer at Viaduct.ai, and now she's a grad student at UC Berkeley where she wrote a paper that we all love here at Weights & Biases called ""Operationalizing Machine Learning: An Interview Study"". This is a really fun interview. I hope you enjoy it. All right. Thanks so much for doing this. I really appreciate it. Shreya: Of course. Thanks for having me. I'm excited. Lukas: I think we've all been watching you on Twitter — or following you on Twitter — for a long time, so it's exciting to meet you. Shreya: You know, it's funny how every now and then you run into a Twitter mutual or whatever and it's like, ""Oh, I know you, but I don't really know you. But I know you."" Lukas: Totally. We were actually doing a podcast with Sarah from Amplify Partners and we both started talking about how much we liked your paper and I was thinking, really, I should just go directly to the source. Shreya: That's so funny. Lukas: But yeah, maybe if you could tell us...before we get into your paper — which I really want to talk to in-depth — maybe if you could tell us a little bit about your career and how you got excited about operationalizing machine learning. Shreya: Yeah. It's such a buzzword and, honestly, not the most exciting thing in the world. So it's kind of weird to think how I got here. I started out kind of doing deep learning research, ML research in adversarial examples, because that was the hot stuff in 2016, 2017. I had this moment of crisis when I graduated college. Should I do a PhD, or should I go and be an engineer, or go into industry? I decided, okay, I might as well go into industry, because I'm trying to write my statement of purpose on working on robustness in machine learning systems and ML system deployment, but I didn't really know what any of those words meant, because I had no experience. So I went to a company — I went to a startup that was doing applied machine learning — and I was there for a couple of years. That kind of changed the course of what I believe to be some of the most pressing problems in operationalizing, I guess, these ML systems. It's a very broad topic. I define it as anything that requires having a machine learning system that's serving some output that people use on a regular basis, that you don't want to shut down. I think that's just a completely different ball game than just the ML research that I worked with. And there's a lot of problems in there, both technically and organizationally, like the processes people use to...like, on-call processes. Things that people do to ensure reliability of their systems. And then of course the tools, the principles, and techniques. I found myself really going back to the databases and data management world in terms of like, ""How do I create these systems so that a bunch of data scientists can train models?"" And that really led me to, I think, doing a PhD in databases, where a lot of these problems — a lot of these MLOps problems — can be recasted as traditional data management problems. Lukas: Interesting. Going back to when you got your first job...we hear about this a lot, but what were the biggest surprises about machine learning in practice versus studying machine learning in school? Shreya: I think I had a nice trajectory of surprises, because I started out kind of as the first ML engineer, and then we grew like 8x, 9x in my time at the company. We hired more ML engineers and more data scientists. My first surprise was that training the model itself — that whole experimentation to first model — is a process that you don't want to replicate with as much human labor as you do in the initial experimental stage. When you deploy it — that retraining, that kind of component — you kind of zoom out on your entire pipeline. You want to automate that so your human attention doesn't go there. It's, ""How do you glue together that model in relation to all of the other stuff you have?"" That was one nice realization that I had. And then after that, I stopped spending so much energy and time modeling itself for the sake of modeling. Another realization that I had was, when you have multiple data scientists working on the same kind of model — or prediction task, or pipeline, or whatever you want to call it — all of a sudden, you need some sort of processes to make sure everyone is on the same page. If I try some sort of experiment, or I have this domain expertise around, ""Hey, the set of features probably won't work well. Or, this source of data is corrupted, so don't try to make features from that,"" how do I share this knowledge in a way that's not a stream of thoughts on Slack? And how do I keep this up-to-date as people come and leave the team? That happened a lot at my previous company before grad school. There's a lot of these small, small problems that kind of built as we grew organizationally. As well as grew in terms of the number of customers we were serving, the number of ML applications we were delivering, or predictions we were kind of serving to different people. Yeah, there's so much. I don't know how to give you a succint answer to that. Lukas: It's funny because you've also had the opposite experience here, right? Everybody talks about the shock of going from academia into industry, but you actually went from industry back into academia. Were there any surprises or misconceptions that you saw going back into the academic world? Shreya: I think it's different for me because I also switched fields. Databases have been...okay, machine learning has also been around for a long time, but the kind of venues that have been popular in databases have been around for a while. The norms are a little bit more well-defined and they're not changing as rapidly as the ML research norms. The community isn't growing at the scale that ML research is growing. In that sense, I felt like I was kind of walking into a completely different territory. I think...what I really like about the database community is they're very open and accepting of new ideas and new paradigm shifts. And I think it's because they've seen it multiple times before. They've seen it, like SQL to unstructured data, or structured to unstructured data. They've seen it from transactional systems to OLAP systems. They've seen Webcale, all sorts of stuff. Like the MapReduce era. Maybe that's still going, I don't really know. In that sense, I think they're very eager and receptive to work in this kind of ML systems or data management for ML space. Which I felt that, at traditional ML venues, it was almost like you need to train models in order to have your papers accepted. If you weren't training models or doing model inference, then is this a research paper? I don't know. I think database is just a better home for the work that I'm doing. Not to diss on all the model training work, of course. Lukas: Why is that? Because you care about practical, real-world applications. Is that a good summary? Or, why does database... Shreya: Yeah, there's a lot of problems around operationalizing models that are data management problems. When you do research in that, what venues are going to accept that research? I'm not necessarily training models in this research. So it's less likely, I think, for ML than [?] to accept this work. I'm also borrowing a lot of ideas from databases around how I think of models, how I think of provenance, how this can be used to solve a lot of observability problems. Things like that. Lukas: Well, so then you wrote this really fantastic paper, which we'll definitely link to. I was almost thinking maybe we should make a required reading before listening to this podcast where you get into details. You wrote this paper on operationalizing machine learning. And you went out and interviewed a whole bunch of practitioners and summarized the field. Which is something that I always try to do, I almost feel like this podcast could be called ""Operationalizing Machine Learning"". I thought you really put things in a really well-structured, really interesting [way], and surprising results that showed that you were really getting deep with the people you're interviewing. But I guess before we get into it, could you maybe summarize the key findings of the paper — for the folks that haven't read it — and then we can dive into the nitty gritty? Shreya: Sure. So, we interviewed around 20 practitioners. The criteria was that they have worked on — or are working on — a model that's being used in production. Basically, it's serving some predictions or some output that customers are using, and somebody will get an alert if the system breaks. That's kind of our definition of production. We interviewed people across company sizes and across applications like self-driving cars, banking, whatsoever. We found...we looked for common patterns across people's interviews. We found four high-level stages of their workflow around experimentation, like the evaluation and deployment, monitoring and response. And then data collection, which wasn't often performed by the ML engineers that we interviewed, but it was a critical part of the production ML pipeline. We identified these four components — or these four stages — and then we also identified, ""What are the variables that govern how successful their deployments will be?"" Like, ""What are the things to think about whenever evaluating tools to use in each of these stages? How do I know if I'm on the right track to a successful deployment?"" And we found that the velocity really matters a lot. The ability to validate as early as possible matters a lot. You don't want to push a bad model to production. You don't want to wait until your final stage of A/B testing, in order to find out that something is not going to work well. So we found that the earlier you can validate, the better it is. And then finally, the last ""V"" is versioning. Which is, ""How do you manage all of the different versions of models that you're going to have as time goes to infinity? How do you think about all the edge cases or corner cases that your system must respond to?"" Maybe that's slapping on a different version, ""If you come from this customer, or you come from this population, we'll give you this version."" Managing that is a pain point. So that's the high-level findings. Lukas: You obviously have a fair amount of experience in this already, having done this job yourself. And you're pretty active on Twitter, and have been in the conversation around this for quite a while. Were there parts of what you heard that surprised you? Shreya: Definitely. And selfishly, I think, I conducted this study so thoroughly and as a research thing that took 1.5 years that we kind of did on the side. I did this because I was so afraid that I was leaving industry and going into academia, and I'm going to go into a bubble, and try to build systems for people and not know what I'm doing, or whether this is useful. What I did not expect was to kind of change my research agenda and direction. One concrete example of this is distribution shift. I used to believe...well, this maybe is a problem depending on how you define it. But the idea that if you have this static model in production — a static set of parameters or a function that's being called on some features, and these features are changing, time is going on — at some point, you're...this is like the classic ""view staleness"" problem in databases, right? You need to refresh your views to keep up to date with the underlying data. If you think of a model as a view on a table, the same thing exists. But I think a lot of the ML literature — or even things that I'd been thinking about are, ""How do I make my views robust to these changes in the underlying distribution?"" In practice, ""Sure, that's great,"" but if something as simple as recomputing the view or retraining the model solves my problem of staleness, then why don't I just do that? You'll find that at these very large organizations like Meta, Google, Amazon, et cetera, that they're simply retraining their models like six, seven, eight times a day or even every hour. And distribution shift is not their problem. In this setting — when you're retraining all the time — retraining on corrupted data becomes a problem. ""How do I make sure that my data is clean and uncorrupted? How do I identify when to block a retrained model from being promoted back to production?"" All these sorts of problems...it's like, oh, these are very interesting research problems, but this is not what I thought of distribution shift to be. Hopefully that answers your question. I can think of others. Lukas: Totally. The people that you were talking to, were they like individual contributors building models or more like managers? We often see like a separate MLOps team that's sort of doing the infrastructure, while other people are kind of doing the training. Who were the kind of folks that ended up in your study? Shreya: We required that everyone was an ML engineer — responsible for a model, or pinged when the model predictions are bad or someone's complaining — at some point in their career. Some of them had switched to the infrastructure-building side. Some of them had become manage...I think two of them had become managers. That's written in the paper. Everyone there acutely knew what it was like to have put a model in production and somebody complained about the predictions. That's really what we wanted to drill into. Like ""All right. What did you do to fix this? What does your team do to fix this?"" Simply retraining the model often fixes it, like 80% of the time. These ML engineers have so much on their backlog. If they can kick off a retrain, and get to something else on the backlog, and it works 80% of the time, that is going to be the solution. That is the best solution. I feel like more ML researchers should know this. But it could be...maybe I'm biased. Lukas: Were these people mostly doing unstructured data? One of the big dichotomies that we see is structured versus unstructured. Where unstructured, you often get more neural net techniques, you get bigger models. You get almost a totally different stack in many cases. Did you observe that too? Shreya: Definitely, we talked to some people who had very image-heavy...self-driving cars or autonomous vehicles was a good example of this. Lukas: For sure. Shreya: They're definitely using neural networks. I think when drilling, though, into data quality and this kind of data management, I think people tend to think about relational data management. ""How do I manage the embeddings?"" or ""How do I manage tuples?"" I don't know if we've gotten to the place where we're thinking about traditional data cleaning and data quality, in terms of images or other unstructured data. We didn't focus the interviews too much on that. Lukas: Were people doing a lot of exploration of features? Did feature stores show up much in your interviews? Shreya: We explicitly went into this not wanting to hit the buzzwords, just to see what buzzwords would come out. Lukas: Totally. Shreya: Feature stores were almost never mentioned. Lukas: Interesting. Why do you think that is? Shreya: I think people thought about the idea of a feature table or a feature service, but very few people said the term ""feature store"". What mattered to them was just having features that were available for them to query. Oftentimes at organizations, it just happens to be a cron job that's populating features in...obviously, not at Meta, they're not going to have a Postgres table of features. But in a lot of mid-size cases, mid-size companies, where you can have Postgres table with features, and you can have a cron job that recomputes features every day, and it's fine. I think it ends up going back to this view staleness problem. How stale does it need to get for your business to experience some performance hit? I don't know if you need to be computing them on-demand all the time. Lukas: I love your simple categorization of data collection, experimentation, evaluation, and monitoring, and response. Did it feel like...of those categories, I think you said data collection was usually a different team. Where were your respondents spending most of their time, and where do you think they felt the most pain was? Shreya: Because all of the pipelines that they talked about were deployed or already in production, people did not focus on experimentation as much. I imagine that this is not representative of the ML community at large. I think there's a lot of people who are still working on getting their first production pipeline out there. Lukas: Just to be clear, none of these questions are leading. We're not... Shreya: Yeah, I just want to say this is definitely a biased subset towards production pipelines. I think the evaluation and deployment...actually, I think monitoring and response... It's hard. 50/50 on those, just based on the annotations of the interviews that we did. Or the codes and how we grouped them. It was very 50/50 on those two. And they often link into each other. People will talk about problems with monitoring stage deployments. Does that fit in monitoring or stage deployments? I don't really know, but I think it's definitely a big pain point. Evaluation and beyond. Lukas: One of the key findings here is that monitoring for data corruption or catastrophic errors is more important than monitoring for data drift. Shreya: Totally. Lukas: But you'd sort of imagine that monitoring for data corruption would actually be a lot simpler. What makes that so challenging to do in production? Shreya: I'm writing a paper on this, based on some work with Meta. In the limit, people may add features to a model. They don't remove features. What happens with these models ends up getting hundreds of features, to thousands of features, to ten thousand features. That's one thing. You've got models in production with tens of thousands of features. Another thing is that people are coming and going in these organizations. The ML engineer who built the model does not exist at the company anymore, and the model is still in production. Couple that with existing data monitoring or data cleaning solutions — which is, defining schemas for all of my features. Like bounds, acceptable values, types for each one of them — great. Who is going to do that and maintain that as these feature tables or as these pipelines evolve? I don't know. The other thing is, because you have so many features, the probability that at least one record and one column is corrupted is so high. And then you get this problem we talked about in this paper, of just straight alert fatigue. It's so painful. At the end of the day, it doesn't matter if just a couple of records are corrupted in one column. The problem is, again, when does it get so bad that it brings down the business? And how do I find that pretty precisely? Lukas: It's funny. I'm nodding — I've lived this myself, many, many times. That's why I totally agree — but I'm actually thinking if I hadn't lived it, it might not be obvious how this happens. Is there a concrete story you could tell, about how a feature gets corrupted in production and the havoc that it causes? Shreya: Yeah. Okay. I want to give...I feel like this is a question where people will attack me for any answer. If I give an example of a Meta or a Google, they'll be like, ""Oh, but not every company is a large company."" Lukas: I think the story just illustrates the chain of events. Of course we're all so smart that we never have bugs, but... Shreya: Sure. I think I'll give one example at my previous company, which I lived. Features were generated from different sources. When I say different sources, it's not just weather data or whatever, it's like, different clients have different data. And then also, we have different data pipelines that are repeatedly pulling from Snowflake or repeatedly generating features. Oftentimes these pipelines will fail, because maybe there isn't enough resources or there weren't spot resources available in us-west. I don't know. Things will happen and these things will all be null. Will this corrupt model perform significantly, to where I actually see a regression? I don't know. But this happens a lot. It also happens that some of my clients, they send me data every day. One day, they send it in a CSV or Parquet. One day, they switch the order of the columns. A totally reasonable thing, but again, impacts a subset of tuples, a subset of columns. I could name a bunch of these. I think this is pretty generalizable to most orgs. Lukas: Totally. I remember at my first job, all the features — for some reason, there's no technical reason this is necessary — but someone started just naming each feature column with four letters — literally four letters — for no reason. We just kept doing it, so all the feature names were just a really compressed word. Shreya: Yeah, stuff like that- Lukas: At some point, no one knew what they meant. It was nuts. Shreya: Stuff like that really just makes it so hard to go back and trace these bugs. Why did my model regress 2%, for example? Oh my God. There can be a laundry list of reasons why. Just to even go there and try to investigate why would be a nightmare. Lukas: I guess, though, if the people you talk to are just constantly retraining over and over, that actually might be one way to avoid data corruption. Provided the way that they collect the features is the same as how the features get loaded into the model in production. Shreya: Yes and no. Suppose I retrain every hour, and when I retrain, I just fine-tune my model on the last hour's data. I split that into training, retraining, and then I split a little bit of that into validation. My criteria for passing is it has achieved reasonable performance on that small validation set. Suppose that whole hour of data is corrupted. It might just be the case that, ""Great, on this corrupted data...because I trained on it, I performed well on the validation set. Amazing. Same distribution."" Put it back in production, and then somebody fixes the bug, and all of a sudden the performance changes. Because that snapshot of data was different from previous snapshots, or future snapshots. Lukas: Right, right. Shreya: I do think that there is this still data corruption problem. The challenge is in identifying the corruptions at the timescale that engineers react and respond to bugs. So you don't put models in production that won't do well on future data. Lukas: I guess one of the little gems in your paper is the controversy...I forget exactly how you put it. You said Jupyter Notebooks are quite controversial, or quite a ""bimodal distribution"" of responses on that. I'm kind of curious your take on Jupyter Notebooks. Shreya: I think my take is a little bit biased. I'm not old enough to have lived the history of data management tools, like spreadsheets and whatnot, when they came out. But from my reading of old work, it seems that these quick and dirty prototyping data tools were used to tell stories, and have primarily been used to tell stories, regardless of whether it was done correctly or not. I think that this is the case for a lot of data tools. Jupyter Notebooks are not really an exception. While...if I want to start a company around my opinionated...""I don't want errors and I want...No one's allowed to use Jupyter Notebooks,"" I think that's just an opinion. I feel like it's completely useless to go and try to prescribe a philosophy to a industry that has a pattern of using these data measurement tools. Lukas: That's kind of interesting, because I actually feel like you're putting yourself in a place where lots of people might come to you and be like, ""Shreya, what should I..."" People come to me all the time, and I think you're more qualified to say, ""How should I set things up? Should I be letting my team use Jupyter Notebooks?"" And I guess if someone asked you...am I hearing you right that your answer would be, ""No, don't use Jupyter Notebooks""? Shreya: Oh, gosh. I think it really depends on the application, or what company I'm trying to run, or what team I'm trying to do. What are the engineering predispositions of the people on the team? Lukas: Man, you're turning into such an academic. I love it. Shreya: I'm sorry. I can't give... But I think that's the point. One thing about this paper is that it's an academic paper, so we can't write all of our opinions in there. But I really wanted to drive home the point where it's just like, the reason that we think that people have these conflicting opinions is because they have conflicting priorities. Do they want initial velocity to be a higher priority than validation? That's personal. Or, that's organizational. I think those values are different for everyone. Lukas: That's fair. I would think that, over time, your priorities would naturally shift. Especially, I guess, as a startup founder. In the beginning, you don't know how useful the model's going to be. You don't know if it's really going to see the light of day, even. And then over time, you really want to start to nail things down. You worry more about the downside risk. Shreya: Then you have to account for this infrastructure. Or transition in your organization, from Notebooks to whatever, if you want to deprecate them. We interviewed one engineer who...they had this whole...their quarterly goals were to get evaluation of models out of Notebooks and put them in this standard system. Didn't matter what CI/CD tool or whatever, but the whole point was just get this in a standardized system, so that people would stop running Notebooks as a way to show that...and everyone had a different fork of the notebook. I don't know, I feel like stories that just make me like, ""Oh my God, no one is working on ML."" No one is working on their direct company objective, because they're fighting their infrastructure battles and dealing with all the tech debt that they introduced from having the Notebooks. What is the trade off? I don't know. Lukas: Were there other stories like that, or themes like that, where there's a consistent regret of something that people did in the beginning, that they now can't get rid of when things are in production? Shreya: The people who we interviewed that were more senior in their roles — or had been around for longer — just accepted this. It's like, ""Oh yeah, organizational turnover is a thing. Tech debt is a thing. Our goal is not to remove it completely, but how do we keep shipping new things, keep old things up and running in the face of all of the tech debt?"" I think that's a more interesting question to me. There are a lot of one-off stories. I can't think of any off the top of my head that were specific to Jupyter Notebooks. I guess there was one other anecdote, where somebody spent 3 to 6 months trying to reproduce some Jupyter Notebooks, just to make a point that they shouldn't use Jupyter Notebooks within their organization. And then their organization had this push to — this was more of a smaller company — to get rid of Notebooks, or Notebook usage. Again, it's just so polarizing. Lukas: That's a little funny. My honest experience with Jupyter Notebooks is I think they're kind of delightful, but I didn't...I predated Jupyter Notebooks, so I was doing most of my hands-on research before they existed, so I'm just a little more comfortable in the command line. I always feel like a little ashamed that I'm not sticking with the new trends, but it sounds like there may be a backlash coming to these Notebooks. Shreya: I think it's also different. Different people are different. I'm the type of person where somebody hands me a Jupyter Notebook or something, ""Here are some results,"" and I will be like, ""Show me how the results got here."" Because I'll be paranoid at every step of the way. We talked about this in the paper. This paranoia, this sense of paranoia we all get. The same thing is...at least, the same thing is true for me when it comes to SQL queries. If you give me a SQL query, I want to know everything that's in your...I want to re-execute that SQL query so that I get the same result. Same thing with spreadsheets. Give me the spreadsheets, don't take the screenshot of the spreadsheet and save it to me. That's totally personal. I think people are different in their philosophies of how they do this. Which probably affects stuff. Lukas: Interesting. What is your takeaway on the whole space of ML tooling? Obviously, I run an MLOps tools company, but please, you won't offend me with your answer here. I'm really curious. Did you feel like people were using tools or were they rolling their own tools? Did you feel like they should...there's gaps and missing tools? Were you inspired to start a company in the space from the feedback that you got? I think it would be hard for me to contain myself, but I'm curious what your raw take is. Shreya: There's a lot of companies that could be started from that paper, but anyways... I thought that the three Vs thing made tools — or at least the viability of ML tools — make a lot more sense. Like experiment tracking. Weights & Biases is a great example. It really 10Xs the velocity experience within experimentation, truly 10Xs it. No longer do I have to go copy-paste my results into a spreadsheet and back and forth between training script and spreadsheet. It's just nice. Great velocity experience. I think most tools that I've seen in this space don't really 10X in any of these dimensions. Lukas: What are the dimensions, for someone who hasn't read the paper? Shreya: Oh, velocity, validating early, and then versioning. Versioning is an interesting...I think there's a lot of people trying to work on reproducibility and related — I have thoughts on reproducibility — problems, but it really needs to be a 10X experience in comparison to what people used to do with versioning, or with one of the variables. I think that that's really hard to do in the ML tooling space. People are really trying to find that. People are, at least in my experience, simply trying to throw software engineering principles at ML workflows, and hope they land. If it doesn't really push one of these variables, then it's unclear that, to me, it's a successful tool. ML monitoring is also a really interesting space, because people do care about the concept of validating. I want to validate that my predictions are good before somebody complaints. But it's really unsolved. How do we do this precisely? How do we not give people alert fatigue? I think people will go to a lot of extents to...the friction of integrating an observability or monitoring tool can be pretty high if you get results, but people are not getting results. Lukas: What are your thoughts on reproducibility? Shreya: There's an interesting paper... Gosh, I don't remember off the top of my head — this is bad, I should know — but they pose that, ""Hey, exact reproducibility is often just not achievable in a lot of ML settings."" Just because, when you're a data scientist at a company and you're launching the job, yeah, sure you can control your random seed, but you can't control the infrastructure or the GPU provisioned to you, the underlying data that you called from. What matters is getting some percentage-wise...if you're trying to reproduce a model, I want to get the same accuracy or a similar accuracy. I cannot rely on getting the same model parameters. This notion quite...it's a little bit orthogonal to all of the provenance and instrumentation of ML workflows to get exact reproducibility. I'm not sure how feasible that is in a Kubernetes environment, for example, or larger-scale infrastructure. Lukas: To summarize your position, is it that reproducibility is impossible? Shreya: I think for reproducibility tools, we need to rethink what it means to get reproducibility. Tracing, for example. Tracing a data scientist's workflow, saving the exact artifacts. Is that what matters? What is it that truly matters with reproducibility? If I have the artifact but I can't reproduce that artifact, or if I logged artifacts at every step of the way but I can't reproduce them, does that help me? I feel like these are questions that I don't have the answer to, off the top of my head. Lukas: I can't help but weigh in on my own thoughts here, because people ask me about this a lot. I totally agree that reproducibility is much less of a binary switch then people realize. There's lots of things you can do that are increasingly annoying, to make things more reproducible. And I think there's a real cost, so you shouldn't necessarily do everything possible, but I do think most people would stand to gain from going further along that reproducibility tradeoff curve than they're doing today. I always try to explain that to customers actually, of like, ""Hey, we try to make it easy to save a lot of stuff, so that you have..."" Because most people's starting place — at least in my experience — is zero reproducibility. I'm talking about a model they made six months ago, they couldn't even tell you the state of the code when the model trained. Forget about the data that went into it. I think every step towards reproducibility is going to make your team function better. It's going to help you with governance. There's so many reasons you actually would want to do it, but I think it is incredibly expensive to get perfect reproducibility, like you're saying. Shreya: Yeah, I like your definition of pushing further along the reproducibility axis. Then the question becomes, ""Okay, what are the markers on this axis?"" I don't have an answer to this. I'm curious. Lukas: Well, if that's your next paper, I'll definitely...you can come back and tell us about it. Shreya: Not my next paper. Lukas: You had another interesting...I love all the different frameworks that you're introducing here. I feel like you're really good at summarizing stuff and putting them into simple, CEO-style frameworks. You also had this notion of layers that tooling lives at. Could you maybe talk a little bit about that? Shreya: Yeah, this terminology caused a lot of controversy within the authors trying to come up with the names. I don't think anyone loves the four layer names that we have made. I don't know if they're the best. We think about the stack of tools that an ML engineer interacts with, in frequency of ""most frequency to least frequency"", top to bottom. At the top is this run layer. Then we think about a pipeline layer. Pipelines are made up of multiple components. Feature engineering, feature generation training, train/test/split, whatever components you want to have. And then at the bottom, underlying, is all this infrastructure that people use. Their compute, what is the workflow built on top of? We notice or observe that when people want to make changes to their workflow, it's easiest to do in the run layer. It is much harder to kick something out of the infrastructure layer and replace that. It happens in certain companies, but it's a big, organizational effort to do it. They all have many meetings. It's true, it's true. Lukas: What would be an example of a change someone might make at that layer? Shreya: Which one, the infrastructure layer? Lukas: The infrastructure layer, yeah. Shreya: I think one example is moving from GPU clusters on-prem to cloud GPU. That's one big one. Another one's introducing an orchestrator for these servers, like Kubernetes. At least, these are things that I personally experienced, also at my previous company that was at. And even the pipeline layer is super annoying to make changes to. If you have Airflow, Kubeflow as your orchestrator, it is such a pain to migrate that. Every planning meeting will be like, ""And should we migrate?"" ""Yeah, we know we need to migrate."" ""Oh, but it's going to be so much work."" ""Yeah, I know."" Stuff like that where it's...if I'm a tool builder, I really don't want to get into that space, because I'll have to sell so hard to people, to switch a tool at that layer. In that sense, one thing we found is the open source tools that could be integrated at the run layer...Weights & Biases is actually a great example of this. Where it's like, one engineer can simply integrate that into their pipeline, and it will be useful for multiple runs of that pipeline. They're not replacing the pipeline layer tool. We found that those were the tools that the interviewees were willing to adopt the most. But I feel like we could have done a lot more in running a survey, a quantitative survey or something, on this. Maybe that's somebody's future work, whoever's interested in that. I'm hesitant to prescribe this as the end all, be all. Lukas: Prescribe the layers, or prescribe specific- Shreya: Oh yeah, the layers. And the idea of, ""If you're trying to build an MLOps tool, don't try to replace TensorFlow or PyTorch."" Those people have their moat. People are not going to rewrite all of their deep learning models in your framework, maybe, unless you build- Lukas: What if it's JAX? Maybe they'll do it. Shreya: What if it's JAX? I don't want to get into that again. I don't want to get into these debates. But it feels like one layer is easier than the other. Lukas: I really appreciate your open-mindedness about workflows and setups and I totally share it, but here we are in October 2022, and somebody listening to this podcast, they probably...I think what a lot of the people listening to it are looking for, is actually some help in navigating. There's just so many options for tech stacks. SageMaker is not the same as Vertex, it's not the same as Azure ML. Shreya: Totally. Lukas: Where would you start? Let's put on an example of a startup CEO, just getting product market fit, doesn't have a lot of resources, but ML is important to them. Where would you recommend him or her to begin looking at a stack? Or how would they think about that? Shreya: Great. I will say a stack that is tried and true...it may not be the best stack, but I would recommend getting a AWS account or something, having some EC2 cluster set up. If you're working with large amounts of data...I have my qualms about Spark, but Spark is useful and people know how to use Spark as a query engine on large amounts of data. I would also- Lukas: -okay, okay, what are your qualms about Spark in a nutshell? Then we can continue. Shreya: I'm sitting in the lab where Spark was invented. I don't like the idea. I think I subscribe to the database community of MapReduce as a philosophy of processing large amounts of data is not great, because it forces the application developer to reason about record-level problems. If one record is corrupted, how do I handle it? It also forces the application developer to reason about consistency. If I'm controlling the parallelization, if I'm programming a Spark job, or MapReduce job, or something. I shouldn't have to do that if I'm a data scientist. I shouldn't be controlling these kinds of variables. This is what DBMSs are really great for, they abstract away these parallelism internals. They abstract away ACID, how people achieve ACID. The DBMS is doing that. In that sense, I'm not a really big fan of this MapReduce-style of work. It's also kind of inefficient to have a SQL layer on top of a MapReduce layer, on top of whatever the storage is. But that's a separate thing aside, that's not a really user experience thing. As an aside, I think it's interesting the whole DuckDB explosion that we see going on, of bringing these kinds of large-scale data queries back to the DBMS. I'm curious where that's going to go. Lukas: Interesting. All right, sorry, I totally derailed you, but you're saying AWS, EC2, start there... Shreya: Go get an EMR cluster for Spark and have some way for data scientists to interface with these machines. Kubernetes...this might be a lot, but some way for them to access the machines that you have in your computing cluster. That's one thing. I think another big thing is, once you get to production-level stuff, where if you have data pipelines at your company, you need to have some sort of workflow scheduler. Something to define a DAG and then execute those DAGs. Airflow is really tried and true. I know a ton of people hate it, but it will work. It is also known to work in cloud-based settings, and they also have a nice Python DSL to be able to define DAGs. They have a nice UI to trigger the DAG. They have a nice UI to backfill the DAG. I would say that's good enough. And then on top of that, I think people...hire a good data science lead, ML engineer to do the ML stuff. Lukas: Do you have any advice on hiring a first ML engineer or data scientist? Shreya: I wish. I almost want to say don't hire somebody right out of college, but I was hired right out of college, so I shouldn't say that. But the problem with that was I didn't know anything. Here I was writing deep learning models in Jupyter Notebooks, saving them to S3, and then telling other people to read them. No, that's a terrible workflow. A terrible workflow, if you want multiple people to collaborate on the same thing and share learnings effectively and also scale. I think people who have data engineering experience tend to at least think about the right...not just terminology, but concepts. SLAs are a good example of...ML people should also be thinking about SLAs. What is a minimum accuracy that my product should have before somebody complains about the predictions? Stuff like that. How do I schedule things on a recurring basis? A lot ML pipelines can be casted as data pipelines, so I would hire somebody who has data engineering experience for sure. Ideally, hopefully someone who has trained a model, but I honestly think that's less relevant than having the data engineering experience. I feel like I've talked to so many people who have ML experience but then don't have any software engineering or data engineering experience, and that's really hard to convert when you're starting a company. Lukas: Makes sense. Well, we always end with two questions. I want to make sure we get them in. One is, what is a underappreciated topic in machine learning, broadly? Shreya: Machine learning... Lukas: If you were to go back into machine learning and could investigate something, where would you be wanting to look more? Shreya: That is not a data management problem. This is an exercise for myself. I think the idea of interpretability, but I think of it as provenance. The influence functions paper is interesting. If I have an image and I have a prediction of this image, what are the training examples that most likely helped the model get to this image? Techniques to do this are computationally expensive, or are limited to a single point in the training data, or require a full pass through the training set for every...I think there's a lot of work that can be done there on interpretability. Not of the model, but explaining how a prediction got to where it is, based on the data that a model was trained on. Lukas: Interesting. Do you remember the name of that paper? I'd love to put it in the show notes. Shreya: ""Understanding Black Box Predictions Via Influence Functions"". Lukas: Nice. And then I guess the final question — which you are incredibly qualified to answer, I think — is, when you look at the life cycle from research paper to running successfully in production, where do you see the biggest bottleneck, or the most surprising bottleneck? Shreya: Evaluating if this research — or this new idea — actually provides a worthwhile gain over another solution. Is there something off the shelf? Is there something baseline that can get something very similar, or get a performance that's very similar, but not have to go through this headache of understanding the new thing and implementing this headache? I don't even think we have frameworks to think about this. Lukas: You're saying it's painful to try all the different things and see if they actually work? Shreya: Yeah. Do we even need to try all the different things and see if they work? If I'm a practitioner at a company, when do I actually pay attention to some ML research development? When do I actually integrate it into my system? I don't think that people have a framework for thinking about this. At least I haven't heard of a big one. Lukas: All right, well thank you so much for your time. That was super fun. I really appreciate it. Shreya: Yeah, thanks for having me. Lukas: If you're enjoying these interviews and you want to learn more, please click on the link to the show notes in the description where you can find links to all the papers that are mentioned, supplemental material and a transcription that we work really hard to produce. So, check it out.",8004 +Jonathan Frankle: Neural Network Pruning and Training,https://www.youtube.com/watch?v=y7k50Qux9Hc,3955,2023-04-10,I find personally a lot of impact in being Downstream with these problems if I'm going to make messes I have to clean them up so in some sense my policy work is an attempt to make sure that you know as I'm on the bleeding edge of creating this technology I'm also providing that same insight to policymakers so they can adapt to this as quickly as possible you're listening to gradient descent a show about machine learning in the real world and I'm your host Lucas bewild Jonathan Frankel is Chief scientist at mosaic ML and soon to be Harvard Professor he wrote an acceptable paper lottery ticket hypothesis about how neural networks learn and how you can prune them he's also taught policy at Georgetown University Law Center this is a super interesting conversation and I hope you enjoy it as much as I did all right why don't we start by hearing about your sort of Journey to to what you're doing now I think you've had kind of an interesting background in career probably great to start there yeah it's been a winding road if you go and look at my CV you'll be a little bit confused um I think about some of the things that have happened and how they came to be but you know the high level is you know I'm a computer scientist trained from the beginning you know from undergraduate all the way to the present I'm actually defending my dissertation this Friday so you know I can't quite say I have three degrees yet but you know very very close to it um but it's a bit of an odd trajectory um as an undergraduate I did some research on programming language Theory which is what I got my master's degree in then I went and spent a year teaching at a law school and doing technology policy work in DC then I came back to MIT and wrote a paper on cryptography and somehow stumbled my way into machine learning somewhere along the way having never taken a class on the topic prior to grad school and I think that Journey they're really two big takeaways I think if you want to understand me better and kind of get to know how I think about the world when this is you start connecting the dots on the topics that I've worked on and what I've been good at they're all the messy hard to measure problems you know I don't like to work on clean things where you get a nice proof and call it a day I love working on the messy things that intrinsically don't have answers security and privacy lawn policy and now deep learning where you know there's no there's no nice neat proof that's going to wrap everything up and tell us all the answers it's going to be intrinsically messy we're dealing with complex problems in real world data and what we do at mosaic is to try to exploit that messiness and find you know find a path through it in order to deliver more efficiency to people so you know if you really want to connect the dots that's really how you put the pieces together as far as I understand it awesome well you know I'd love to start with um this is probably bad from uh podcast marketing perspective but you know I want to start with the thing that I'm kind of most interested to ask you about which is you know you did a really well-known paper on um uh pruning neural networks lottery ticket hypothesis I believe or what was the the title of it that's that's the one that's notoriety at this point and um you know I guess before we get into it if you could kind of describe the sort of thesis or the key results of the paper and then I have a few questions for you yeah speaking of describing a thesis in my other tab right now I've got overleaf open with that thesis um but the really the really simple statement not the 200 Page version um because I'm sure nobody wants to hear that um if you want to it'll be an archive um but the really simple version is that the neural networks we train you know ask yourself why do we train them in the particular way that we do with that learning rate with this recipe with this Optimizer with this kind of normalization um the answer is usually well because someone else did it and they got it to work um usually you know when it comes to resnet kaiming did it that way when it comes to a Transformer you know what she faswani did it that way and so that's the reason why we train it that way and in many ways the story of my career in machine learning is questioning those choices and in the lottery ticket work I questioned one very specific Choice why do we use all these weights these networks are really big we know their so-called over parameterized but why and I read at that time in my career all these papers on neural network pruning this topic where you train the network and then delete you know connections that seem unnecessary and you end up with a much smaller Network that you know as far as we can tell performs about the same as the original Network we started with why did we need those weights to begin with then is there something intrinsically harder about learning than there is than kind of representing what you've learned like you know is it easier to kind of know the rules of calculus than it is to like learn and process them for the first time maybe our networks have to be big early on and can get small later as they get kind of smarter and have more compact representation that was what one of my professors at MIT told me when I asked him why can't we train smaller Networks and the lottery ticket ideas are one way that I found to make it possible to train smaller networks and the trick is that any weight you were going to delete at the end of training you never really need it you could have gotten rid of it you know at the beginning or nearly at the beginning um but with one catch when we create a neural network we kind of set each connection to a random value at the beginning we you know we have to initialize it to something we don't know what yet and the whole point of optimization is to get those weights to good values but it turns out those those random values aren't so Random or rather you know the specific sample we get from that random distribution is really important and each weight in this smaller sparse pruned Network needs to be set to the right value for it to be able to learn and what I found is actually the values that those weights are randomly assigned actually are really important for making those particular weights important this sub Network won the lottery it happened to get a set of Weights that allowed it to train well if you sample a new set of initializations for it it does really badly and this initialization sensitivity is something that we don't typically see when we train traditional neural networks that aren't pruned ironically it's something we now see all the time with these Foundation models like the whole point of a foundation model is a good initialization so I think these ideas have come back around again in a neat way so are you saying that you that really the point of having many more um weights than you need is just that some of them randomly get assigned good initialization values is that what you're saying it's a possibility and the only reason I'll say it's a possibility is because I'm an empiricist if I'm going to make a claim I need to have an experiment to evaluate it and try to you know falsify it and it's hard to figure out how to falsify that claim if I wanted to do that I'd really have to try every possible sub Network to see how many lucky ones there are and whether you know whether there are other sub networks that got lucky that just didn't happen to emerge or whether this one was kind of the one and only that's my conjecture but testing it is very difficult that perhaps by a certain point in training just you know the network is optimizing in a lower dimensional Subspace that some of the weights just become unnecessary and so learning actually can take place pretty successfully without those extraneous weights or at least whatever Subspace it's in could be access aligned in such a way that you could prune a bunch of from the network no you know again testing that is exceedingly difficult if you have a way to do that please let me know I'd love to have another dissertation chapter um but I think that is the high level conjecture that in the original paper we called it the lottery ticket hypothesis there's a hypothesis and the conjecture and that is the statement of the conjecture and I guess one way to check it is just to go back to what they were originally um set to and see that it has the same quality of performance right so when you go back to what they originally set to and it has the same quality performance that at least indicates that that subnautica is sufficient but it doesn't necessarily mean that subnetwork is actually important to training when the whole network is there it could be that we've kind of we found our way into some sufficient subnetwork that was actually doing nothing in the context of the whole network so being able to say what that subnetwork is doing as part of a whole is a little bit more difficult it's entirely possible that the optimization picture looks completely different when you have the dense Network that isn't pruned and the sparse Network that is pruned and we we do know that those that optimization behavior is pretty different pretty different so you can't just you can't just reset the weights and to what they were when you started training and then remove all other weights and get the same performance then I guess oh oh you definitely can oh yeah so that that yeah that's kind of the Crux of the lottery ticket experiment is you know removing all the weights except those that you kept from pruning and then just setting them back to their initializations and training them again that does work quite well but the question is what purpose was that sub Network serving within the context of the dense Network that subnetwork is good it's able to learn on its own but that doesn't necessarily mean it was useful in any way for the dense Network it's entirely possible that there are two completely different Dynamics going on when you have the whole network versus the sub Network and you know I can't say for certain it's that that gets into a tricky empirical scientific questions that we don't really have an experiment for right now but the You observe that the performance of the sub network is similar to the the entire network right the the dense Network definitely definitely I'm not quite sure what you're saying there the sub Network seems like it sort of is responsible for all the the performance then of the the entire network right or I think it's a necessity versus sufficient distinction the subnetwork is certainly sufficient to get good performance but it's unclear whether it's actually necessary and you know one way to actually test this is to take that sub Network and try here I'll I'll pose you I love these thought experiments take that sub Network and instead of keeping only the sub Network actually delete only the subnetwork and keep all the other weights so you've got a sub Network that's like one tenth of the size of your original Network and you've just wiped it out what do you think is going to happen to that dense Network when you try to train it except missing with that hole in the Middle where the sub Network should be is it gonna do really badly I mean it's guess it's an empirical question but I was sort of imagine that it would find another 10 to to lean on is that is that right that that's exactly right and so it's you know the the claim that it's necessarily leaning on that 10 I think is it's something that we can conjecture about but it's something we can't say for certain because we don't have evidence to back it up because if we were to delete that it'll lean on a different 10 if the Leaning is even happening we could make that claim but we need some hard evidence to show that it's even leaning on that 10 to begin with that 10 happen to have the highest magnitude weights at the end of training but even magnitudes of Weights doesn't necessarily confer importance to that it's hard to say what which weights are actually important and which weights aren't for the function and the using magnitude as a heuristic is a very bad one at least it's you know there are all sorts of fancier ones in the literature they don't tend to work that much better but you know people would argue that magnitudes are very naive thing to do yeah I guess that makes sense do you think that like high rates of Dropout cause more of the network to get used or or I would sort of imagine that if if there's a lot of Dropout happening it might Force the network to use to more of the way it's available at least to have a like a redundant mechanism now maybe so there are a couple complications there one is that Dropout typically works in the neurons rather than the weights and so it may end up having a very different effect potentially and there there does tend to be a huge difference between pruning weights and pruning neurons in terms of how well you do and how much of the network you can prune the network seem to like having extra representation capacity that is extra neurons but each neuron seems to not use that many different inputs hence why you can prune individual weights much more easily than pruning entire neurons even if you're pruning in effect the same number of Weights so the other piece here is that you know we have this intuition for how for what Dropout might be doing we don't necessarily have evidence to back up that that's what's happening the original Dropout paper makes all sorts of you know claims that I would consider pretty outlandish and unsupported by any empirical evidence and I like to only say what I what I have evidence to show so it's hard to say that there's again necessarily a relationship there unless we can come up with an experiment to test whether Dropout is somehow you know making the network more robust to pruning or something along those lines well that does seem like an empirical question right if if Trump oh yeah he likes the pruning what um is there sort of an inflection point in the pruning like is do you have a sense of like hey you can prune up to X percent before there's there's problems that sort of generally holds across networks or across ranges of data or anything like that not that holds in general unfortunately and one nice way to test this is actually even for the same network and the same training regime you can play with the difficulty of the data in ways that make it harder to prune or easier to prune so you know as an example training a network on just a standard image task you can prune some percent of the network and you know let's say for a resnet 20 on c410 something that anyone listening to this can probably train in a few minutes on a GPU at this point that's about 90 of the network that can be pruned or 85 somewhere in there before accuracy completely starts to you know drop off if you've ruin all the weights obviously things don't go very well and there's some inflection point there um if you were to try doing this on an adversarly robust training task which demands more of the network and is is a little bit more capacity intensive one would imagine um you aren't able to prune as much typically before accuracy starts to drop off you know anything the task the way that you optimize the network the final performance they can all affect your ability to prune and how much you can prune I wish there were a nice general rule of thumb the answer is usually somewhere between 2X and 10x um you know compression via pruning although you know in some you know crazy cases if people set it up right you can prune 100x or prune down to 100x smaller um usually people set that up to make their pruning methods look really good in the literature even if um you know at the end of the day those are toy examples that are just meant to get gaudy looking numbers as opposed to you know really being scientific is it even consistent across training runs like the to the printing performance stay the same on the like same data set with just say different random initializations it does tend to be pretty consistent across random initializations and random seeds but then again we've chosen you know in some sense we've evolved the way that we train these networks to be consistent across initializations and seeds you know we've we've spent 20 or 30 years trying to do exactly that and you know to the point I mentioned earlier about how these sparse networks are very picky about their initializations I imagine that if we had made it the goal 30 years ago to have networks whose sub networks aren't picky about initialization we might have a completely different architecture and completely different optimizers so you know it's we have to remember that you know 30 years of grad student descent has landed us on these particular networks with good properties that you know in this case we're exploding and so I guess there's you know there's tons of different properties of uh you know a network that you could um that you could examine like what what is there like a practical application of printing that gets you excited about it or what um like what even caused you to to look into this I like it honestly for the scientific application I'm really excited about the idea that we can understand how these neural networks learn more effectively um you know right now or at least when I started doing my research back in 2018 one thing that really struck me was just how utterly unscientific the literature is just littered with all these claims about flat Minima and about the noise of stochastic gradient descent about what Dropout does about internal covariate shift most famously with batch Norm just you know in terms of a completely made up in the paper and never actually tested to see whether the effect was real before they post their you know supposed remedy um that was just how the science was back at that time and I feel like I sound old when I say this the science has gotten a lot better but now those sorts of claims don't generally get into the literature without some evidence supporting them and I like to hope that you know the lottery ticket work was part of that Trend that you know we do want to get a better scientific empirical understanding and it's not enough just to you know say things and not try to support them with facts the way that a lot of the older so-called great papers you know from around 2014 to 2017 do but I mean the other piece was obviously I was very jealous of the labs at MIT that had gpus my lab did not um and I thought you know that's not fair can't we make this more efficient can't we get rid of some of those weights won't that reduce the cost of training unfortunately you know doing unstructured sparse pruning is generally it's very difficult to accelerate that because you know it's an irregular pattern the Hardware's not designed for it there's certain specialized chips that can do unstructured sparsity pretty well but you know they're not widely accessible and the sparsity isn't generally applicable right now um you know for those listening who are working on those chips feel free to let me know if I'm wrong but that's certainly been my experience so far um so you know I would say this was a bit of a swing and a miss on that front it was certainly effective on the scientific front we've got all sorts of cool ideas that have come out of the lottery ticket work but I think for me Mosaic is really kind of you know the second at-bat in some sense it's an attempt to ask the same question you know how do these networks really learn empirically and is everything we're doing necessary or our recipe is actually good or are there better ways to train them out there this time without the sparsity which is hard to take advantage of and instead with an eye toward anything that will actually speed things up and actually produce cost savings you know immediately today on real hardware and so I guess you know it's funny I was thinking maybe this is a good segue into into Mosaic but like when I think about um you know Transformers and attention that is another case like Dropout where we have these sort of like evocative words like attention that one wonders um you know how how real the sort of um hand-wavy explanations are but we still I think you know kind of generally use them I'm curious if you have thoughts on how um much Transformers is sort of just a product of somebody doing something that kind of worked well and everyone's sort of copying the details of it or you know some kind of fundamental Insight like do you think if you ran back history 100 times you would get um Transformers like like what parts of Transformers do you think you would get in in every case and what's just sort of the the one-off of like the the sort of random path that we took to this architecture that's a great question and it's hard to say um I do think there is something pretty fundamental to what we call self-attention I don't know what it is that's so fundamental to it that's very tough to say um it does seem to work quite well and we've had plenty of attempts to replace it that have you know had varying degrees of success but still nothing has supplanted it and given that the attention is so effective and also so cheap relative to the massive feed forward layers we use and you know our our giant models you know the the really you know 10 billion plus parameter models today um there's no reason not to use it if it's effective it's not really asking to be replaced in some sense the way the patch Norm is asking to be replaced um if anyone has a bachelor replacement please let me know I want to get rid of it very badly I think it's such a simple architecture I think we would have arrived there eventually like at the end of the day the self-attention is really the most powerful new component otherwise it's just a feed forward Network and the self-attention was already kind of bouncing around the literature in various ways and the folks who wrote the paper really put the pieces together exceedingly nicely you know these these good inductive biases are hard to come by um I have a bet going with Sasha Rush right now that Transformers will still be the go-to way to train NLP models in another five years or so and I that that is placed because convolutions have lasted a really long time in vision in the vision Transformer is still something that I almost never get asked for at mosaic it's an academic curiosity by and large the rest of it is the real Workhorse still um so you know convolutional networks are they've stood the test of time the current Network stood the test of time I think Transformers will as well these little inductive bias insights are pretty hard to come by but when you take a step back they're relatively simple tricks at the end of the day and I guess what um what then matters for for speeding up training at this point and kind of what are the things you're working on at mosaic around that everything matters um I wish I could tell you that when we get a 7x beat up on a Model um like we did with net 50 on image net or a three or four x beat up on Bert pre-training which we'll announce um probably by the time this podcast is out it will be announced um or you know the speed UPS will have coming on gpt3 that I don't know yet but I'm sure we'll be out by the time this podcast is out um I wish I could tell you here's the one neat trick you need to do to get that speed up the answer is the resnet 50 recipe that 7x is 17 different interventions into training affecting everything from data augmentation the inductive bias of the network the optimizer regularization the loss function it's basically anything and everything that there is in the network even shaping how things go over the course of training I wish it were one thing but you know as with all good systems optimization it's five percent here five percent there um and once you stack enough of that up you get to something really impressive sounding like 7x but I guess the challenge you know with like maybe all neural net research right is like each experiment is kind of expensive and these things don't typically in my experience sort of add up um linearly like how do you even kind of know what's what's contributing to your to your speedups this is why we have a research team um this is the hard part of our jobs is trying to piece together what may work together with what um you know people often ask me is the secret sauce of Mosaic speed up methods and the answer is no um you know we put that out open source for free um The Secret Sauce of Mosaic is the research team that has developed the methodologies and the intuitions and the ways of thinking about the problem it's an emerging science this kind of Science of composition it doesn't necessarily we don't necessarily have a good recipe I wish there was some automated system that would do it for us so I could tell all the researchers to go to the beach but a lot of it really comes down to some principles we're developing like the early part of training nothing that important or interesting tends to get learned so you know we can generally get away with say decreasing the resolution of images or truncating sentences or you know playing with the masking ratio for a bird or something like that principles like that you really only have a certain budget of regularization for a given training run length and so you need to use that wisely on things that won't slow down training and there are regularization methods that are you know no effect and some that are actually pretty meaningful slowdowns and you have to Choose Wisely on that front some ideas around balancing which parts of training you you know if you make back prop faster you got to make forward prop faster as well otherwise you start hitting diminishing returns on anything else that makes back prop faster but there is a lot of art to this as well how do you get it such that it's good enough for this model but not so over overly specific to one data set that it won't work if somebody comes and has a new data set they want to try on this model there there's some kind of balance between you know how specific and fast the recipe is and how General and perhaps slower the recipe is and again these are all kind of you know in some sense subjective trade-offs that we have to make it is a little artisanal at this point do you think it'll stay artisanal does that I think it'll stay pretty artisanal um in some ways that's good for business like if it's not artisanal enough my research team needs to find other stuff to work on but it is artisanal insofar as every model is different the way that we train it is different and each of these interventions is different and has a weird effect and you in the I think of lottery ticket as being just one of these interventions among dozens that we've tried at mosaic and so the way that I've spent five years getting to know lottery ticket and all the ins and outs of how sparse networks behave we kind of have to get to know each of these interventions and how it behaves and what effects it has and that's a you know that's a long journey and then seeing that applied to a new model sometimes our ideas translate over and sometimes they don't and understanding why or why not you know that's new knowledge we can use to build on and and try to understand these methods better but they do almost feel like friends in a lot of ways and they you know they have complicated personalities and understanding how they work together is It's Tricky um frustrating at times and makes our researchers want to pull their hair out but you know it I think it's really there is some intuition and some high level rules have done that we start to use but I don't think we'll be automating this anytime soon it's like Network architecture search but or hyper parameter search but even more difficult because now we're adding different changes to training whose effects are difficult to predict until you've really trained the model to the end and so I guess before we go too far down this path we should probably talk about Mosaic ml I mean what's the the story behind it what do you guys offer to the world the way that I'll try out a new way of describing it you can let me know if this is any good I kind of I like to try these things love it so um you know in the hardware world we have these foundries um we have a company like tsmc time one semiconductor Manufacturing Company they're you know all sorts of geopolitical interesting right now because they are the best place to train to get your semiconductors made they have the most advanced process technology the smallest transistors which means you know the best power efficiency and the best performance of anyone and there are a few other companies in the world that do this they're Samsung um you know there's Global foundries which used to be part of AMD Intel has its own internal foundries but at the end of the day if you're Nvidia AMD Apple um anybody you go to one of these foundries and you say hey there's a chip I'd like to get made um they give you you know tsmc gives you some high level apis or some high level you know abstractions you can use to design your chip for them then you go and you hand it to them and say print me the chip and they're not experts in chip making you know you know tsmc or in designing chips tsmc doesn't make its own CPU um they're really really good at taking your designs and bringing them to fruition they have all the latest technology they have the most efficient stuff they know how to improve yields and I think of us at mosaic as being tsmc interesting can you describe like uh What uh talk about One customer and what their their specification looked like definitely so we're not in the business we're not making a gpt3 clone you know we're not training our own language model for us we're never going to put out an API and say this is the Mosaic GPT come here instead of openai or something like that we're happy to take all comers and say come print your model we have the latest technology for doing it efficiently we know how to improve yield as much as possible um so you know we know how to kind of make these training runs go well the first time I can stretch this metaphor quite far and we'll see how far it goes yeah but we've worked with um you know we've we've worked with several customers now who want to train large language models that have specific Properties or where they have some specific data set they want to use or for compliance reasons they can't use gpt3 internally you know you can imagine lots of Enterprise customers are pretty concerned about what might be in that model and the fact the training data is in public and everybody has lots of data that's been one thing that struck me companies don't realize they have lots of data but everyone has tons of unlabeled data you know someone came to me and they were using some pre-trained models in imagenet to do some Vision stuff and we asked you know do you have any unlabeled data and they were like yeah we've got a little and we said how much nine petabytes I mean I was like and you're pre-training on imagenet are you kidding me someone else you know they were using the Bert pre-trained on Wiki text um and we said to them do you have any unlabeled data in your domain and they were like yeah a little or like how much 300 billion tokens you know enough to train an exceedingly large language model it's like why are you using Wiki text so I think the data is out there and you know people want these large language models often they want them with some specific Properties or with you know something tweaked about the model but that's what we're here for to give you you know a great process technology and the ability to customize to what you'd like so I like this Foundry metaphor quite a bit and I hope it I think it distinguishes very well what we do and what we don't do we're not here to you know put the chips in computers and run them we're not here to do inference um you know we're not here to design your model for you we're not a consulting company we're here to help you build the best darn model you want for the lowest amount of money and to get something that works really efficiently but I guess when you say that you know you won't build the model for me it kind of sounds like you're building the model for me like where does I guess I should specify what building the model means um I'm not going to tell you how to you know I'm not going to tell you which model you should use I'm not going to tell you you know you should use a resnet for this and you should use a GPT for that I'm not going to tell you uh here's how to set up your optimization problem on this like I'm not going to go through and help you curate your data or things like that really my focus is on you're ready to train let's train the best darn model we can and let's get it right the first time um so I'm not here to solve your machine learning problem or to set up your machine learning learning problem for you I'm here to help you train the model that you'll eventually use in production you know once you've figured out how you want to solve that problem you know strategically how you want to go about it interesting and I I guess like one challenge putting myself in your shoes might be that it's kind of hard to know at least in my experience it's a priority you know how well training is going to go like how do you work that out with the prospective customer um It's Tricky we I think we're learning how to hold people's hands through that process a little bit in the same way that I'm sure tsmc does not say you know here are tools let us know when you want to print a billion of these chips um you know you want to go through and you do sampling and you try to you know you simulate the chips before you build it we have a similar process you know before you train that 100 billion parameter model we should probably train a billion parameter one and make sure that things work end to end and you get a result that looks reasonable um did you use the right tokenizer you know are you getting results that you know if we train a 1 billion then a three billion are the results getting better um you know is your data quality high enough can we go back through and you know make sure that all the inputs are looking good um and we've seen everything that could possibly go wrong at this point and there's so much that goes around when you train these models um I know the folks at Facebook were really kind and put out their logbook for all the stuff that happened when they were training the opt model um the 175 billion one um everything you can imagine went wrong Hardware dies mid training run um you get lost spikes the resumption times are really long if you're not careful like these multi-terabyte checkpoints you have to load I'm getting the data loader spun back to the point you are in training could take a very long time um and just some weird stuff happens I mean in some cases they're just like you know on a training run this long you'll get memory errors like you know that one in a billion you know cosmic ray striking your memory will happen um you know you get things like oops I used a different tokenizer for training as I did for evaluation and so all my evaluation results look really bad even though my model is really great so we try to get all that stuff ironed out at the small scale and then we'll work with you to kind of you know go bigger and bigger and bigger until we're ready to you know to to build the giant chip as it were and really actually train the model and I guess in my experience people typically you know don't just want to train a model one time they want to continuously update it forever do you then sort of take over that process for a customer how does that work we're certainly in the loop on that front we have apis and and a python software development kit where you know if you want to do data intake and just schedule retraining to take place you know you can just do that on our platform you know really you know we've got the platform and it's very easy to program around it for simple automation like that and we have tools to help you do that um and yeah I think you're right like a lot of people say to us aren't you going to have a bad business like customer is going to come to you once train that model and then leave and I think the metaphor our CEO Naveen likes to use is like if you're building a piece of software and you get to version 1.0 do you fire all your engineers and say oh I'm done um you know software's done no of course you've got more features you want to build you've got things you want to update nothing is ever really done and I think it's good for our business but also you know once you've done that big training run we're developing ways to make sure that your second training run is much cheaper based on taking advantage of aspects of your big training run and that's a place where we're investing pretty deeply in the technology so that each each incremental run should be cheaper and cheaper than the last one almost think of it like a frequent flyer program you know the more you train the more you save in some sense um but there's a lot of really interesting science behind how to do that without you know having your first model determine how all of your other models are going to go because your data may change a lot and I guess how do you think about like engaging with the research Community I mean obviously you're still publishing papers but you also kind of talked about your your secret sauce is there sort of like a bright line in your mind about like you know kind of what you publish and what you what you keep to yourself definitely I mean the first thing I'll say is we don't publish um that is one line I did draw for the team early on we're not Google brain um we're not here to be an open-ended research team we have a job to do in customers to serve we do science in service of that but you know for anyone here who's looking for an interesting job um don't come here if you want to write papers or do open-ended research that's not what we do um we do like to share our results and we'll talk about everything um we have to talk about our speed up methods if we don't talk about them imagine if you came to me and said hey I want to train this model and I said to you well we're not going to train that model we'll train something slightly different but I won't tell you what um that's a secret you wouldn't really trust me we do have to be open about that algorithmically so the secret sauce is really I think a couple things one is the expertise we built as a team to be able to really attack these models and speed them up the secret sauce is in some sense experience and wisdom and kind of the culture and the scientific practices we have on the team the you know the way that we make money is that you know we put all our speedups out there but our Cloud platform you know our orchestration software the tools that make it really easy to train these giant models they kind of the manage version of this that you have to pay for and that's really when you're training a large language model good luck doing it without this you're going to have to stay up 24 7 and watch the loss for the spike and then figure out how to roll back and restart and a lot of those tools you know are part of our paid offering so you you do publish then your your algorithms am I understand that right oh sorry sorry let me let me clarify the word publish here because I think we're using it differently we don't submit papers to conferences for publication in that way but we certainly do share openly what our algorithms are what our recipes are and that's all available in blog posts and in our open source composer Library so that is you know freely available for anyone to to see and use but I guess published in the academic sense um honestly it just takes too long and you know we can disseminate our results without having to go through peer review and all that good stuff I see I see you know another topic I wanted to make sure I hit with you is it seems like you're you're a bit of a skeptic of this current approach sort of leading to AGI um and you seem kind of maybe quite sure about that that point of view I wonder if you wanted to sort of say um say more how you came to that or if that's a fair characterization of your or perspective I think it's a very fair characterization of my perspective um I genuinely you know first of all getting a good definition for AGI is pretty tough um either it's kind of everything or nothing it's something you know pie in the sky that will never really reach its true human intelligence um in which case you know trying to get that out of a feed forward neural network seems like a you know it's I think I've heard the metaphor building the ladder to a move to the Moon a lot lately um you know that's not how we're gonna get there it's going to take something fundamentally different um if AGI is kind of you know it's possible GPT chat as AGI um if you know you have a very narrow definition of what AGI is and I've heard some people arguing that so if you want to go I think AGI is really an All or Nothing term and I'm more of the you know more people talk about it as the all sort of definition you know this is really truly General human-like intelligence and the ability to learn and adapt an environment in which case um a feed forward neural Network's not going to get us there that's just you know this is not fundamentally not the right technology for that I really think AGI is being used pretty cynically by a lot of people in the field as a way to you know get people to to give them money either get people to give them money because they claim they're going to make something happen or get people to give them money because they claim they're going to study something catastrophic that would happen but either way I take it as a cynical kind of you know in pursuit of resources and pursuit of power and money not something that people mean very seriously at least you know other than the extent to which they're misleading others interesting well what are some things that some sort of like reasoning tasks maybe that you think a feed forward Network surely wouldn't be able to do I mean you know the the average feed for Network today even GPT chat is probably just looking at a very long context um and so you know if that context is essentially our memory space and our state space and the model is able to just write back to that context and reuse it for future token prediction um that's a pretty raw way of giving the model the ability to interact with itself and interact with an environment um so you know it's hard to point to a task people like to quiz me on this like what is the SAT score which you'd consider this to be AGI I had someone really really Badger me about that when I gave a talk recently when I expressed skepticism of AGI so it's hard to pin on a task to say the model can't do this right now and if it does that that's ATI look at the Turing test I mean we've been passing the Turing test for 40 or 50 years and that's been a pretty awful test of whether you know something like Eliza had AGI so you know I don't like to point to one task and say you know this is you know this is a thing that something must do in order to be AGI um but I don't think we have you know I don't think the setup of a feed forward Network where we're just adding tokens to a context and hoping that it's able to take advantage of all these tokens you know isn't any way going to lead to some kind of general intelligence I mean I guess honestly my you know I I don't think I have a super strong point of view you know but you you know you you seem like very like empirical and it seems like a very strong claim to say you know surely this approach like kind of can't do this and I guess maybe you're saying that it's just sort of so like poorly defined that you know that's not sort of like a meaningless claim um but I guess I think it's a meaningless claim but I also think that you know for many definitions of AGI I don't think feed forward neural networks are really going to be able to pull it off yeah so I guess I'm just trying to get at that of like what like what are some things maybe I'm not like this is not like a gotcha question it's just for like what are the kinds of things that you think feed forward networks will never uh be able to do I mean right now we're watching GPT chat really struggle with long context lengths where someone goes back and forth with it for enough iterations that it starts cycling or it clearly loses track of what was happening earlier on um we still don't even know how to solve that basic problem that's a problem we're going to have to overcome we can really do you know large-scale you know just handling large amounts of information being able to somehow reason about it you know hierarchically or something like that um you know we're still nowhere close to that and I think now that people are finding some of the some of the soft spots of GPT chat where we're seeing that happen in real time that's a basic problem we're going to have to overcome these models still attend to the tokens that are closest to the current token They Don't Really attend that far um and you know if these things end up in reasoning in Loops because of that but if we want things to reason this seems like a pretty inefficient way to get something to reason in and of itself so I'm I'm pretty skeptical that just taking the same things and making them bigger will solve any of these problems and those are basic problems we're gonna have to overcome before we get there you're unusual that ever I think maybe like a really strong interest in in policy um maybe could you kind of tell us what a little about that and kind of what you think is sort of important at this moment I guess it's December 2022. um like what what should we be um what what kinds of things are you advocating for so I I'm curious I'm gonna I'm gonna turn turn the question on you for a moment how would you define policy I love to do this to people because you always get interesting answers how would I Define policy um I guess my first thought around my first thought is is government regulation of of what companies can and can't do and then I think there's um there's another thread of sort of like what maybe outside of Regulation what company sort of should do um to make sure that they're the work that they do has a positive impact on the world what am I missing so your your first one here you know what you know regulation I would consider that law but not policy um so that's an instantiation of policy but I think the big distinction here is this question the second point you got too what should we do what is kind of the ought or you know what is yeah what should we be accomplishing what do we want the world to look like and in some sense you know the rest is implementation details that's when you get to law of concreteness um but even policy at a high level can be simple questions of what should we do or what direction do we want the world to move in and so from a policy perspective I don't see policy as necessarily advocacy advocacy is one thing you can do you can advocate for you know what we should do with the other is you know simply providing consultation to the people who do make policy to the parliamentarians from around the world or you know what have you the people who are setting policy and trying to figure out what direction they want their countries where they want the world to move in that tends to be the role that I take um you know I I tend to be a technical expert that gets called in to help provide context to policy makers on Topics in this case related to machine learning but in the past it was privacy or security so I spent a year at Georgetown law kind of as the technologist in Residence um helping them to you know to make their decisions better on what kinds of policies they recommended or you know what kinds of research they did or how they understood you know what their findings were on various topics specifically in that case police of facial recognition in the US we did a big study showing that I think at that time one third of all American adults were in a police facial recognition database at that time if that was Earth shattering news today I think you know we all understand we're probably in Clear View and a bunch of other things we've you know we've given into a surveillance state in a way that we hadn't before today I spent a lot of time with an organization called the oecd which is kind of you know a un-style organization that does economic policy for kind of mostly the Democratic capitalist countries and helps to do research and help them you know do things like think about National AI capacity and how they should be setting that so it's less about advocacy but I think the important distinction here I'd make is I'm one input into this process I'm a source of consultation and a source of expertise and a source of detailed knowledge about how AI does and doesn't work I can provide feedback I can make recommendations about you know when someone has a policy goal what the right implementation would look like um I do see a lot of us in computer science kind of expressing the hubris that we should be the policy makers or we should set the final policy um you know we're not just one input into the process we know better than the lawyers who've been thinking about questions of say fairness and bias for you know decades centuries however long that is um a lot of our questions around alignment or safety or things like that you know did we ever realize there are regulatory agencies that have been dealing with say you know automobile safety for a very long time and probably have some good ideas about how to structure constraints and what we would think of as a safe car in computer science we tend to have the hubris to make to think that we can reinvent the wheel better than other people we like to disrupt things in the case of a lot of these topics I think we're leading ourselves wrong and perhaps the right ways to engage with the people who you know have built up expertise specifically and taking on these kinds of ambiguous questions that don't have clear answers and you know we should we should be Consultants but we're not the only input into that process we should trust people who have legitimately studied this not people who have made up new definitions of fairness um because they thought it was interesting maybe I'll ask a question in a different way like what um are there I guess like like you have like front row seats to um you know the sort of explosion of um I guess like use cases around you know language and and vision models what kind of concerns you the most about about where things are are headed I think we're getting to a place this is not a novel concern but I think we are getting to a place where I think you're seeing this even with all the things I've seen on Twitter with gvt chat these models are very confident even where they're full of crap these models sound very convincing even when they're speaking complete nonsense and we don't have a way to tell the difference right now um and you know we've seen this danger in you know many times in the past in other forms that you know information from a source that seems reliable it kind of feels reliable doesn't necessarily have to be true and the the ability the the line between what doesn't doesn't feel true let me put that a different way I think that it's becoming a lot of our training as people about how to tell the difference between what is and isn't true and what should and shouldn't be trusted is being exploited by some of these models in order to convince us that things are true that aren't or make things seem real that we're not we're not mentally prepared for a model that sounds really confident and speaks really intelligently um but it's just bsing because it's a language model that was trained on Wikipedia and Reddit um or you know pictures coming out of something like a diffusion model that really seem real but aren't were the world is moving much faster than our cognitive biases are I think we'll adapt in the same way that you know people adapted to yellow journalism back in you know back at the turn of the 18th to 9th or the 1800s to the 1900s um you know if we've adapted I think reasonably well to fake news as people we're now pretty skeptical of what we read online um you know even if it looks like it comes from a publication of some kind and we'll adapt here but things are moving so fast it's hard for anyone to keep up um and it's hard to really like I don't know I don't think we're ready I don't think our cognitive biases are quite ready for the onslaught that came last year let alone the one that's coming this year let alone the one that will come next year you know and I guess as as someone you know like building a Foundry for for making like lots of these models is there like are there like things that you feel sort of like obligated to do on your side to sort of like kind of help with these issues or is it really just sort of like a training of um you know the consumers of these models to to maybe like you know not not just confidence which might be you know useful uh think people are there anyway no we're definitely obligated and there are a lot of different ways of addressing that I I find personally a lot of impact in being Downstream with these problems if I'm going to make messes I have to clean them up so in some sense my policy work is an attempt to make sure that you know as I'm on the bleeding edge of creating this technology I'm also providing that same insight to policymakers so they can adapt to this as quickly as possible and to make sure that you know we're we're asking the right things of people as these mother as these models change part of it is also that we need to be responsible about who we work with um you know there are some companies that at the end of the day we may choose not to work with or some you know organizations we may choose not to work with um if we don't think they're mature enough to handle this technology properly that means partially that we need to move further further down the chain not just to you know how do you build this model but you know I'm trying to think of the right metaphor for you know in the chip world but it's probably something like helping a company pen test their processor to make sure that you know I'm not going to tell you how to build it but I do want to provide you with a tool kit to make sure that you know you built it such that it's robust to X Y and Z such that you don't have timing channel attacks so you know we may need to move further down the chain and help people evaluate their models effectively that is something though that a lot of fantastic organizations are out there already working on and far be it from us to reinvent the wheel part of it is you know having great Partners like weights and biases who we you know we work with extensively um to make sure that we can offer customers a full solution no one company is going to solve all their problems but they're fantastic companies out there looking at questions of bias who are probably going to adapt as these models you know as these models fool us or you know you know get around our cognitive biases and increasingly sophisticated ways I'm going to want to show up to a customer saying hi we're Mosaic um we train models but you know here's our preferred partner who we work with closely who's an expert in how to help you evaluate and test this model before you put it out in the real world and we highly recommend you work with them here's our partner in the same way that today we say here's our preferred partner for experiment tracking and we highly recommend you work with them because they're great we don't want to solve everyone's problems it's about you know putting all these pieces together into one great solution but you know at some point if we get big enough we'll certainly can be an Advisory Board of some kind that's something I've seen work reasonably well from my perspective as a policy person in that world and there's certainly a lot of friends who you know you know who you are I'll be calling on you for a favor to help us make good decisions um you know someone on the outside who you know has the trust of the community and has my trust to help us make good decisions on that front someone did liken what we do you know one of my friends to you know building cyber weapons for people um it's these models can certainly be used in that way and we do have responsibility to you know help make sure that these models are being used carefully and you know keep an eye on what our customers want to do with them I guess you know you have really front row seats into applications of these models do you what's your perspective on like what sort of new use cases are opening up with this technology everything I mean I will give you a worse answer than just browsing your Twitter feed right now to some extent um I'm watching my research group back at MIT try to make GPT chat you know write programs um I guess I won't scoop them because this podcast won't be out for a little while um but they just have a slack Channel where they're playing around with this and looking at the strengths and weaknesses of this model and you know it's really impressive even if it's you know deeply flawed it's so impressive that we've gotten to the point where we have something where the flaws we can point out are things like well it doesn't seem to remember facts that well like that's a huge win over where we were a couple years ago and we need to celebrate these wins even if these things aren't perfect um you know or folks who are using diffusion models for all sorts of really creative things I don't want to scoop them either um you know but you know things that go beyond artistic purposes or you know things that go toward using it to generate new products or new ideas or new design um really the I'm trying to think of the right really I'm not the domain expert here and I think the beautiful thing about this is the barriers to entry are low enough that any domain expert can see how this tool can help solve their problems I had someone reach out recently about you know using this for some sports related purposes that I thought were really cool I wouldn't have thought of that but this person happens to work in the sports industry and had an idea for how to use you know a large language model for that it may not be quite the right thing or it may take some more machine learning work but it's at a point where this is mainstream enough that I'm not the one to tell you the cool applications go look at the world and all the brilliant creative professionals out there and the people who are in their own Industries trying to solve problems they're the ones who should do it you know in in service of not trying to have too much hubris since they were so trying to be humble in the way I'm around policy I'm I'm The Foundry um tsmc is not the one thinking of innovative Hardware Solutions they're not cerebris building a wafer scale engine um you know they're not graph core thinking of you know a whole new way to organize a chip or you know anything like that they're just saying oh that was brilliant um cool use of our technology Let's help you make it but I guess if I'm coming to Mosaic I'm probably doing more than just like fine-tuning you know an open source model out there so I must have something that I really really care about so I I'd love to like just get I mean who who comes to Mosaic to do this and why are they doing it I mean I understand the like why you might not want to use um you know gpt3 which has to be like hosted in the cloud and you know you can't hold the model but like you know there are like open you know model like language models out there um right now and and like what what causes a company to to undertake this sort of like very big expense of building one of their Own Foundation models data it's specifically data in one word and by that I mean that your data is your identity um you know that that means both from a personal perspective but also from a company perspective every company is sitting on so much unlabeled data we now have incredible methods to leverage that unlabeled data between images and text and probably soon combining the two and combining all sorts of other modalities that data is your identity and would you rather use the same identity as everybody else would you rather use your identity would you rather use Reddit you know which is probably a pretty large part of what's gpt3 would you rather use your data and your identity we're really coming for so that makes sense but then like what are people doing down Downstream of it that's like so important that they're willing to make this huge investment oh anything and everything from you know actually using this for customer service scenarios to using this to do interesting you know open-ended classification tasks or the kinds of you know few shutter zero shot tasks being able to prompt in a domain specific way um you know anything and everything all the standard applications that you see of a GPT model you know as applied to whatever their scenario is inside their company it's the same applications but with the benefit of having your data imbued into this model one thing that we always think about or kind of one Paradigm that I'm thinking about a lot these days is the idea that these large language models are really databases they know things um you know you can you can query you know the GPT chat and get all sorts of really interesting facts out of it a lot of those facts aren't quite right there was a beautiful thread earlier today about asking you know what the fastest Marine Mammal is instead of Falcon um you know but it has knowledge and it has facts there's a great bit of work from ophir press I'm a researcher at the University of Washington he's a PhD student right now he's on the job market by the way I have to say that for anybody I know who's on the job market um doing this thing he calls self ask where he gets the model to kind of Reason through a task by asking repeated follow-up questions and then he did this really cool thing where he Googled each of those follow-up questions and had and took whatever answer Google gave in its knowledge box and gave that back to the model so having these models now interact with databases or having these models be the databases themselves and probably some combination between the two I kind of think of every relational database out there as a in exact version of a database you have a schema you have a way of querying data these large language models are in some sense soft databases you can ask them natural language questions they can find relationships between data that aren't expressed in a relational database but might be expressed if you've given enough data and you know teach it what text looks like or what language looks like and you can query these things like databases or even connect them to databases as well so I think that's really for me I see that as an emerging application area just thinking of these not as models but as like these soft databases that give you the ability to make connections that you could never do if you had an exact relational databases they're kind of fuzzy databases or approximate databases I'm sure someone will coin a much clever term than that but you know that's kind of that's how I think of them today and when you think of it from that perspective do you want your database to be whatever it is whatever web crawl open AI used or do you want your database to be your data the answer is probably a combination of both probably both I was going to say yeah yeah but you certainly want your data in there and if you have a lot of data whether you're pre-training from scratch or starting from you know opt or some other you know pre-trained starting point you still need to imbue this model with your data how many um companies do you think will try to build these large models from scratch hundreds at least possibly thousands um at least our business so far seems to reflect that possibility business is really good for training large language models um we do feel a little bit like tsmc right now and that you know getting capacity on tsmc is really hard and you've got to be an apple-sized customer to build a book that capacity a large way in advance and I I feel that way right now in terms of how we're booking our large language model training at the moment my team will certainly if you talked if you manage to find them right now they will tell you that as well but the answer is everybody everybody is sitting on so much data many companies are going to need lots of these things and you know lots of companies are going to need at least one of these things we're seeing this all the way down to relatively small companies that want to do some fine tuning of these models because they have you know some business specific data that is really important for them to be able to use these models effectively so I genuinely think you know what is it GPT chat got up to a million users like after a week of use I think that should tell you some something about the number of people who have found interesting use cases or at least very curious about where this technology might fit in and if that's a company they have a lot of data that they can use to kind of customize this model for them and it really at the end of the day if we've learned one thing from these large language models it's all about the data the models aren't very interesting they're Transformers yeah they're it's a cool technology but Transformers are pretty simple compared to an lstm or something like that it's not even about the way we train it other than the fact that the way we train it is really expensive right now and Mosaic needs to get that cheaper the data is where the magic happens the data is what gives this model its super powers and that's a place where everybody's going to want to customize it's interesting you say that as someone that that does the the model building like do you do you offer any kind of services like Active Learning or like ways to improve the data if you feel that the data is the most important input into the model how do you engage with the data um right now I mean we have a lot of excellent Partners who are fantastic at this and we're a startup we need to stay focused and our focus is on making it cheap enough that you can even contemplate doing this the data is a problem you only think about when you can actually train the model and if I told you the model was 10 million dollars to train well you know in many cases you don't care about data quality at that point you just know that you're not going to be able to afford to train it if I tell you that model is a hundred thousand dollars to train then there's another conversation to be had and I think for the first time we're even having that conversation in a way that I don't think we could have prior to a lot of the efficiency work that's happened at mosaic and melon elsewhere so now that we're in this place it's something we're definitely thinking about and it's something that we're building tools to provide um but it's something that you know there are a lot of fantastic companies out there and a lot of fantastic partners that are experts in data and far be it from us to you know welcome and think we know better than them can you tell me some of your favorites I I mean I'm I'm curious I'm I'm neutral in this because I used to do a data collection company but I haven't in years do you have a I I don't wanna I don't want to play favorites or you know in public or anything like that um you know there are a lot of great folks out there and we've we've worked with a lot of different folks in the past and we're probably going to work with a lot of folks in the future so I don't want to play favorites here but you know it's a really competitive space which means there are a lot of really smart people working hard to make to do better at data curation and you know data labeling and you know I would trust them over me right now for sure um you know I haven't been doing this for years the way that many of these companies have and you should go to The Experts we're really really good at training and we do that really well um and depending on what kinds of data someone's working on and you know what kinds of customers or what kinds of companies are looking to work with you know we can point them in a number of different directions but you know it's it's fun being a startup it's like being an Academia in some sense nobody's an expert in everything and if you want to accomplish anything significant you have to collaborate and I like to look at the world as being full of you know awesome collaborators who we can work with all right well um we always finish with two questions and and I want to make sure I have time to get them in and one of them is something that um I'm sure you'll have a good answer for which is um kind of outside of what you doing now what's another research that you wish you had time to look into like what do you think's I guess an under um appreciated research topic in in machine learning oh man there are so many um I would tell everyone who's working on adversarial examples or Federated learning or anything else that's kind of you know academic and not very exciting to go work on any one of these instead um you know for me I am really excited about the data questions I think understanding how much data quality really matters understanding how much these things like reinforcement learning with human feedback or just instruction fine-tuning actually matter are these red herrings you know opening I said they were important why should we believe them why should we believe anyone until we've reproduced this and seen whether it's actually valuable um there's a great opportunity for us to you know for those of us in Academia and you know in my other life I will be an academic next fall again um and technically I guess I still am for another three days until I defend um for those of us in the academic world these are fantastic questions to ask I don't think we should take for granted that someone said you know someone like opening I said this is what they do or talked about it and therefore it must be what they're doing um these are questions we can scientifically approach and I think the beautiful thing about this is they're not that expensive to take on in a lot of ways especially the fine-tuning questions those just involve fine-tuning a model they don't involve training something from scratch the data questions are a little trickier but even at small scales we should be able to see effects that should give us a sense for what may happen at large scales and so for me this seems to be the key leverage point right now I mean the the other questions that always get me excited are the questions of how these models really learn this is a wildly complex process where we're taking tons of data tons of parameters and tons of compute and throwing it together in a stew and mixing it and sometimes good things seem to come out but what are kind of the chemical processes actually happening there how does learning take place what does the network learn over time and how does it learn and what are those Dynamics and that always gets back to you know the application is the Mosaic question of how do we speed it up but the mere Act of understanding how Learning Happens is itself a fascinating scientific question and if you know for those looking for PHD programs and you know I don't think it's still too late to apply to Harvard this fall you better really start writing and you know get your letters like yesterday but you know if you want to come work with me on a PhD at some point those are the questions I'm most excited about academically because I just find the fact that you can stir all this stuff together and get something like gpt3 out to be endlessly mind-blowing I totally agree um totally different direction I guess um when you look at taking these um these models and actually getting them running in the real world you know both sort of inside of Mosaic and then I guess that's your customers where you you hand off the model I'm curious about where you see the unexpected bottlenecks like what are their actual hard Parts about um you know getting a big model doing something useful for someone in the real world it's always the stupid stuff just like you know anyone who's ever done software engineering knows it's always the semicolon you forgot it's never that your algorithm has the wrong time complexity it's always that you mixed up two variable names whether you accidentally overwrote a variable somewhere in your for Loop and didn't realize it we see that all the time like the example I gave of using a different tokenizer for training than for evaluation and thinking that your model is not training at all it's those kinds of mistakes but there are so many different places where this can happen because these are such complex systems that are so nuanced that you know it's those dumb mistakes that kneecap you like did you know and I didn't know this it on an a100 or a V100 if you change the memory format of a resnet to channels last if this doesn't change anything about the model it's a one you know model dot I think channels last or something like that you get a 30 speed up did you know I sure didn't know did PhD student really PhD student me really wish he knew that yeah um that would have saved me a huge amount of time and a lot of money for my advisor and I would have done a lot more science um who knew it was written down in a beta Pi torch documentation page somewhere a few people in the know knew about it um we wrote it down and put it out very publicly and we try to do that with everything so people know where to find it but it's the little things that always kill you not the big stuff at the end of the day we can figure out how to get the stuff running but it's the little things that just drag you down as Death By A Thousand Cuts a lot of the time I think you know the uh the other killers are always efficiency um when you're training at this scale if it's not efficient it's going to take the life of the universe that's kind of the Mosaic problem in some sense the other is that there's a difference between getting it running and getting it running smoothly and reliably and consistently um the difference between having something jury rigged I'll give you a quick anecdote here whether this makes the cut or not we'll see here's how I trained my models back in my PhD Google was kind enough to give me some free TPU capacity but there wasn't a job scheduler associated with this you had to just manually spin up a TPU ssh in start your job SSH out let it go check in at the TPU sometimes froze up or died or my code crashed so I wrote a little orchestration platform on top of it I wrote a Google spreadsheet and used the Google spreadsheet API so that whenever I typed a line into the spreadsheet with an experiment my little demon process I had would read the experiment off check to see if it was already running if not it would assign a TPU kick it off run it and update the spreadsheet with its status change the color of the line and everything and I used that to train I use this to collect over the course of my PhD almost a million checkpoints I just deleted those checkpoints because they were taking up God knows how much space and it was over a million files deleted this was the jankiest thing in the world it worked to my advisor Chagrin it worked it was terrible and unscalable and unsustainable and if I'd had to do something even bigger if I were running a business and this were what we're holding things together God help me um it was good enough for a PhD student and a lot of US settle for that quality of solution a lot of us say we're okay with that um you know in the weights advice is context a lot of us say I'm okay with tensorboard even though it requires like 128 CPU cores and gigabytes or terabytes of memory and it still crashes all the time and doesn't really do anything right and in some sense there's there's nicer stuff out there we know how to build better tools but I think many of us are we don't either we've been in Academia for so long that we feel like we don't deserve better um you know why should I have access to the nice tools I'm an academic I've been in Academia my research team says this to me all the time like you know our our Engineers will say why didn't you report this bug why did you just like do this horrific thing to work around it and you know after we go through enough rounds it typically comes down to well I didn't want to disrupt you and I didn't feel like I deserved to have this bug fixed we've we've accepted the self-flagellation as part of being an academic or being a researcher but to some extent you know I think it's that like there are clean good solutions that just work out there for a lot of these problems and if they're not someone's building them right now um and that janky solution you knitted together using slurm um you shouldn't use that that tensorboard-based solution that like crashes half the time and gets the lines mixed up and has these weird spikes and gaps and everything in the data there's something better out there than that um and I think you know we should be we should be willing to use it we also have this philosophy in computer science of like I shouldn't have to pay for things um my poor friend who was like using Libra office when they could have just paid a relatively small amount of money to Microsoft or Google um and gotten a product that really genuinely did work um but instead they really wanted to do that or my I had a professor I worked with who really wanted to use his Linux laptop even though it almost never connected to the Wi-Fi um and barely ever printed and I think you know we look at things like slam retensive board we think I see those as kind of the Libra office type solution it will kind of sometimes get the job done but also it's not that expensive to get the real thing especially if you're an academic there's academic pricing for all this stuff and if you're a company um the cost of making your engineers time the cost of your engineers time absolutely dwarfs the amount of money you'll pay to any one of these companies out there to get a product that works so I do think there's something tied up in this you know did I home brew something crappy because you know I didn't know any better um did I deserve better um you know do I have this intrinsic trained interest to not pay for things because you know that's I don't know we have an open source ethos in computer science but all of those things I see dragging people down a lot they certainly you know to some extent the tools didn't exist when I needed them to but now there are fantastic tools out there and I if someone were to have the same infrastructure I had or the same tools that I had we have the same capabilities I had the same Hardware the tpus they should not have done it the way that I did it in this day and age there are tools out there that will help you to use that infrastructure more effectively so you know we should be focusing on the problem here and I think you know I've strayed way away from your question rambled a lot and we'll see how this gets cut up when all said and done but I think to make a very long story short it's not as hard as it used to be there really are tools that will help you avoid making mistakes and you should just use them well you can you there's a role here for you as a head of marketing if you're interested it's great too uh great to Great to talk you know as a dedicated user of your product um you know we're willing to pay to not have to suffer and it's worth it for us nice thank you we appreciate it if you're enjoying these interviews and you want to learn more please click on the link to the show notes in the description where you can find links to all the papers that are mentioned supplemental material and a transcription that we work really hard to produce so check it out,15423 +Scaling LLMs and Accelerating Adoption,https://www.youtube.com/watch?v=sD24pZh7pmQ,4072,2023-04-20,these models have seen the internet and so there might be like leakage try there might be tests at liquid leakage and so I think the only reliable measure is actually putting it in front of people and asking them you know which models you prefer you put two models in front of them you let them interact with both put prompts into both and just pick the one that they prefer it's the most reliable measure you're listening to gradient descent a show about machine learning in the real world and I'm your host Lucas bewild Aidan Gomez is the CEO and co-founder of cohere and the author of the seminal paper attention is all you need which introduced Transformers I've been looking forward to talk to him for a long time and he's super willing to answer my crazy questions like how many universes in the metaverse does Transformers exist in the form it's taken this is a really fun thoughtful interview and I hope you enjoy it I think the place that I really wanted to start was you know you're one of the authors of the attention is all you need paper which has to be one of the most influential papers of all time and I think most people listening this would have read it but maybe you could describe the result just to to catch up the few that haven't heard of this paper yeah for sure I appreciate you saying it's one of the most influential papers of all time I don't know if I if I believe that but it's definitely influential for the current time Fair yeah I was part of the paper back when I was at Google brain I was just like a an intern there during my undergrad and I landed under the management of lukash Kaiser um and he and I built this this software platform for training big neural Nets called tensor to tensor um and then during the internship very early on we connected with Nam and Jakob and Ashish and Nikki and convince them to train the models that they were exploring on tensor to tensor um and that model what they were what they were building ended up being the Transformer um so it's like a it's an architecture which the purpose uh was really to come up with like a much simpler more refined scalable architecture than lstms rnns which were the category of model deployed at sequence sequence learning before the Transformer um and you know some people view them as some people view Transformers as extremely complex um I think most of those people are putting it in comparison to lstms and what came before because Transformers are really just like an attention block uh stacked on top of a multi-layer perceptron or a feed forward layer um and then a bunch of those stacked on top of each other and so I I think that's like an incredibly simple and um yeah you know in comparison to what came before it very very simple platform um to build off of obviously with the benefit of hindsight um the importance of scale is now obvious but back then I don't think it was um there was like the general rule of thumb which was you know the bigger the neural network the better but I don't think people really grasped how far we in the field needed to push that there were still fears of you know big models over fit like overfitting was this big fear and you don't want to get too large because you're going to start actually performing worse at the tasks that you care about um but I think we took a bat that you know we can take this to a more extreme level and scale this up to not just single digit accelerators or tens of accelerators but thousands of accelerators um and that bet paid off um yeah I guess what inspired you to try the attention mechanism without the underlying lstm I remember when you know attention you know was was sort of first came out on these you know recurrent neural networks um I guess in hindsight it seems like an obvious thing to try but um I guess what what inspired you to to do that so like my understanding of where the inspiration came from um attention had been already hugely successful actually with rnns and with lstms um and it became this really crucial part of the stock in that old generation of of sequence learning with rnns um but I think the the notion of attention from a neuroscience perspective um was really really attractive uh and it was it was viewed as something that was fundamental uh to to intelligence to human intelligence um folks like Jeff Hinton I think really appreciated that and the people that you know I was working with like gnome and like Jakob um they saw that same sort of Elegance in the structure and they say it saw that same fundamental importance of attention as a computational modeling structure or tool and they wanted to put that front and center in a model and see what popped out the other side um and so taking away all the other complexity taking away all the other little tricks and hacks they wanted attention to kind of shine through as the the fundamental unit of intelligence within this neural net architecture yeah so that was a seed attention is such an evocative word like it it it sort of is hard not to sort of extrapolate you know to our own brains and and things like that when you use the word attention but yet the math of a transformer is super simple and sometimes I wonder if the math came from a different path it might not be called attention I imagine do you think there's any truth to that like how fundamental do you like fundamentally attention [Music] is this is this thing yeah I mean I think um the the other the other way of describing it is like a soft lookup right like you're doing some sort of soft look up in a table um and by Soft I mean you know a hard look up would be just looking at one entry in a table a soft lookup would be looking at a bunch of different entries with different weights Associated um and as far as I'm aware the tension was branded attention by um bad now or or you know whoever that was but there were parallel efforts in integrating lookups into into neural networks so I think you're right that that description of attention is very Salient and from a um from an intuition perspective from making a computational concept intuitive attention is just way better than a lookup right a soft lookup attention we can understand we can kind of grock it we get we get what the model is doing underneath much more quickly than if you uh use language from database anymore something like that so from a branding perspective I think attention just took off because it Maps it maps onto our own understanding of these things much more closely it gives us better analogies to work with the concepts okay so another experience that I've had with neural networks is that it seems like the exact details of the architecture often doesn't matter and we end up kind of picking these architectures based on you know trying to replicate previous papers and not wanting to you know mess something up how fundamental do you think Transformers are and maybe to make it a more well-formed question if you've ran back history a thousand times how many of those times do you think you would get the Transformer architecture is specifically yeah that's a really interesting thought experiment I I don't think it's that fundamental um I think what you need yeah I think this is less and less of a hot take um but I think all you really need to do is saturate compute in a good way you need to come up with an architecture that scales um that is sufficiently simple that people can mess with the architecture itself and explore their own ideas within it um uh but yeah sufficiently scalable that you can basically go to any parameter cap that you want in a nice way so do you think we're the only Universe in the Multiverse where there's the exact Transformer calculation I I think it's probably countably infinite uh you know instantiations that do use the Transformer but do you think it's more than measure zero of the yeah I think it's measures I think do you think it's malicious zero it's fascinating yeah I think there's like loads of loads of combinations we could have landed on um and I mean I I just feel like it's really unlikely that the Transformer is optimal um I I think there's like still more exciting Avenues to explore in particular ssms um so state-space models those are emerging as like really promising alternatives to the Transformer um how does that I'm not familiar with that how does how does that work yeah I'm not familiar with it either but I do I do think it's it's potentially The Next Step there's uh there's this idea that'll give you a link to learn more about that yeah yeah I think read the S4 the S4 paper um okay but the the Gen the general principle what makes you think that they're promising if you don't understand I'm kidding I'm being I'm being uh facetious um okay so at ssms the idea is that we're trying to cut some middle ground in between Transformers which are like fully auto regressive they attend over the entire past sequence uh and then on the other end of the spectrum are LS tips or rnns which um they have a state and they just they need to memorize in order to remember the past um so ssms are trying to find this Middle Ground where yeah you have some window within which you can do look up um but for everything that's outside of that you can rely on an internal memory that you can read and write from um okay so doing that which sounds a lot like a uh Middle Ground between the two um doing that while also being extremely scalable uh so you can you know paralyze it across thousands of tens of thousands of accelerators um and so it's trying to strike that middle ground um I think its success is going to be predicated on whether the community builds tooling because obviously like the folks at hugging face with the Transformers library and many others they built incredible software tooling for Transformers they make it like trivial to scale from you know 10 million parameters up to a trillion parameters nowadays um and they've made that trivial which is tons and tons of work at the software level for ssms for State space models it just doesn't exist today none of that exists there's no mature software platform for scaling these things um so I could see a world where Transformers get replaced ssms the software support for them gets more mature and our next generation of model um we lose that context window constraint that Transformers give us where it's like you have this many tokens anything outside of that sorry I have no idea about it I've never seen it instead you get something that in theory could have an infinite context window and it could just continuously read and write from this big buffer of memory um although it's simply the latest GPT 4 release had quite a large context Windows kind of surprised to see how big that context window had gotten yeah yeah fancy to make that possible yeah yeah there are nice techniques like Alibi and others that let you quite naturally extend so even if they did train it on let's say let's say they trained it on a context window of 8K tokens um they're able to serve one that has access to 32 000 tokens because they have methods that um at inference time can quite naturally quite easily go beyond that limit um that that was like an important breakthrough for not alleviating completely the limitation of Transformer context windows but giving you like a a workaround um I still think we need to push past that I still think memory is going to be an important an important piece of that but definitely it it makes the issue far less painful can you say more about what it means to saturate compute like I guess from my perspective correct me if I'm wrong the the thing that Transformers has just done phenomenally well is scale as compute gets bigger and bigger and it seems like almost like nothing else like it's just you know in every level of compute it seems to you know continue to the performance continues to to improve and I don't think that was true of other architectures that folks have used in the past but if Transformers is is measure zero in the in the metaverse what are other ways that you think the compute could be could be saturated yeah I think a very high measure set of architectures are the ones that saturate compute that's that's kind of like a necessary component of success um what does it take though for an architecture to saturate compute um it requires tons and tons and tons of map malls and very few um very few unnecessary Ops you really want your entire architecture to just look like big big map malls because you can just I mean a sort of like very classical just like fully connected neural network would would satisfy that condition right that would be ideal that would be ideal why was it a multi-layer perceptron um work then or it seems like it would satisfy your condition is laid out I think Transformers are almost MLPs they're they're like a very short skip away from MLPs um so they do look like just a bunch of a bunch of madballs they add in like one more access to do map malls across and that's the length of your sequence um but I think it's it's basically trying to cut as close to a massive MLP as you possibly can because those do saturate compute the best what you don't want um what you want to avoid are tons of little Ops like softmax is and um little activation functions Dropout layers all of these little things which break up those big map models and you certainly don't want to introduce operations that require um that break your horizontal scaling of layers your ability to split a layer across tons of accelerators so your width-wise parallelism um anything that you have to do kind of intralayer access across these chunks um that slows things down because now you need to start communicating between those parallel lines of compute um so you want to minimize the amount of like intralayer communication you have to do and just maximize your ability to paralyze that let it run and then come back with the result um yeah and and so that's really just a computational optimization right there's nothing sort of like intrinsic about that intrinsic in what sense I guess do you think almost like any architecture trained for long enough with enough parameters would have this property that it would um improve over time and then really what you're looking for is just something where you can do back propagation distributed quickly yeah is that what you're saying I do I I do believe that there are a lot of possible architectures um that would be fast efficient um and result in performance that we're seeing from the current large language models um they're like there are some things you can't literally just scale up a MLP with release um because that would be done Point wise right it would just be a bag of words model you wouldn't be able to learn relationships between words you do need like sequence structure you need to train sequence models um but so long as you're not breaking that or like severely compromising that I think there's a huge swath of models that would perform equivalently well and would scale equivalently well um and we've mostly just had this joint optimization between our software like the Transformers and the the Frameworks that train them and even our Hardware like the routines that we support within Cuda and within our accelerators that it's been feeding back on each other for a little while now um and so at the moment there might be like a local minimum where it's like it's hard to break off of Transformers because there's just so much software and Hardware support for Transformers it's so heavily optimized as soon as you step off of that rail um you're sacrificing some performance because now you're dropping back to a more General implementation of a kernel which runs 20 slower which for a big model costs you I don't know a million dollars or something um so there are some interesting effects there with that that feedback between hardware and software uh and US kind of getting locked in to one architecture so then do you think that there's no architecture that's sufficiently better than Transformers that there would be an impetus to change um do I believe that I I think stuff like the sequence length constraint could be problematic enough that it actually might push us off of Transformers yeah I would not be surprised if that was the case it's pretty that is like a scaling quadratically in the sequence length is a huge problem um one that you only run into once you start doing more and more interesting stuff but I think we're heading into that territory uh especially with multi-modality I would think then that you would predict that the model performance will continue to scale consistently as compute is added basically forever would you agree with that yeah definitely um yeah I think it is quite predictable if you fix the data set you fix the architecture and you choose your lovers of scaling um it is going to give you quite a predictable scaling pattern or scaling law and and how do you think about data constraints as the model complexity Rose like is is data going to be the next constraint then for the for building very large llms yeah it already really really is um so I think of these models like I think of the tech stack for large language models as a a literal stack uh where each piece built off the one before and we started with like um base models like gpt3 and you just train on I don't know like a a trillion token scraped from the web and super noisy but um you learn a little bit about a lot of different stuff you consume a ton of Knowledge from all over the web um that model is like it's not useful it's not use it's very cool that it has some you know prompting fuchsia behavior and that type of thing but it's really hard to make it useful for you um and so what you need to do is then train what Edco here we call them command models opening eye calls them instruct models uh you need to train these um uh these like instruction tuned models from uh human feedback and that changes the model into something that is way more useful changes the UI onto the model now you can just give it a natural language instruction it leverages all that stuff that it learned from the first training phase of just the raw internet um but it's just a much more pleasant user experience those are like far far more useful um The Next Step Above That is dialogue models and so again you go out you collect some human data of of dialogue of conversation and you fine-tune that command model that instruction following model to do dialogue again that unlocks another huge Improvement in user experience people just love conversation right that's what you and I are doing right now it's it's the natural intellectual modalities you know the most natural thing for humans to do is chat to each other have a conversation and so when you flip into that modality you flip into conversation um it's driven by data when you flip into command following or instruction following it's driven by human annotated Theta um so this Tech stack that we're we're working along like the the vertical momentum is like ostensibly entirely driven by data there's almost no modeling changes like you're sorry momentum oh vertical momentum up this Tech stack I'm thinking base model command model dialog model etc etc um the momentum up that Tech stack the thing that takes you from one layer to the next it's all data and I guess what is a model like before it's had this kind of um this sort of specific human feedback to make it respond to commence like what is what does it feel like to interact with that it feels a little bit um what's a good what's a good way to compare feels a little bit like schizophrenic um it's like it's like very um all over the place it's hard to get it to do what you want you're trying to give it instructions and it might break in some very unpredictable way you might change one word and suddenly it it gets what you're saying and is capable of actioning on it um but it's a very inconsistent partner to build with and to create systems with um so it knows tons of stuff and has tons of ideas and it you know can vaguely pull from its big vast breadth of knowledge um but it's it's very disorganized it's very hard to pull uh coherent Behavior out of it and then that's what that's happening in that in that fine-tuning phase with the human feedback like what exactly are you doing um so you're you're collecting examples of the behavior that you want the model to uh possess so in like the command phase which is just one layer above the base model you're collecting pairs of examples that are like I give the model an instruction um and the model response following that instruction you do that a bunch of times for a ton of different tasks and the model just picks up the general idea of okay the user is going to tell me what they want me to do with some data and my job is to just respond with the answer and it starts to behave like like that and so now it's you're not it's not super finicky about the placement of words the word choice like kind of indifferent to that and the that UI is just way more usable way more useful and and how do you measure if that's working or how do you know when you're done with that part yeah so measurement of performance like evaluation of these models is incredibly hard um for a bunch of different reasons one of them is like a lot of the academic data sets that we have are super noisy and low signal and um they might actually be like negative signal like if you're doing really well on a particular academic data set that might be like cause for concern which is worrying um and then the other piece is that these models have seen the internet and so there might be like leakage try there might be test set leakage and so um I think the only reliable measure is actually putting it in front of people and asking them you know which models you prefer you put two models in front of them you let them interact with both put prompts into both and just pick the one that they prefer that's the most reliable measure um but what I was just describing of like going from the base model up to the command model um you can just measure that in academic data set performance like if you throw a base model at these academic data sets and you throw a command model which is really just like a derivative of the base model um the command model is going to perform just leaks leaks better dramatically better and like effort is this is this step like is it like comparable in cost to training the base model or like how like practically like how long does it take like how expensive is it it's really hard to get right of course it's all data collection the scale of data is way less and the like duration of training is way shorter than the initial phase like the base model phase we're certainly not talking about trillions of tokens um in this curated data regime um but like the specific tokens that you choose are extremely important they need to be noiseless they need to be extremely um you know Pi combed over and make sure that they're clean and make sure that they there's no you know noisy samples in there um and that's a really difficult expensive process it involves a lot of humans looking over a lot of data um in terms of like comparison of the cost it's definitely less than the base model um it's definitely less I don't know like the relative Horatio um yeah I would say it's all expensive uh it's certainly expensive to get humans to generate data for you that is like extremely costly um and the more valuable the data the more you know smart or uh challenging or um the more valuable the data the more expensive it is to get you can imagine if you want your model to be as good as a lawyer in answering legal questions the data that you need to collect there is going to be from lawyers that cost a thousand dollars an hour and so the cost can be extreme I guess when I look at the performance of large language models there it's it's it's hard to not sort of like infer this exponential curve and sort of expect that you know like in a year um they're going to be even more amazing than they are today but if your view is that we sort of hit this wall with data for the base model is that wrong then they're like are we on the verge of like needing a totally new approach I think we we are approaching the need for another um scaling breakthrough um I guess I don't know how close we are to it but it definitely feels like it's it's coming up and we're starting to hit the limits of these models are increasingly operating at average human performance or above on such a wide selection of tasks um that you can no longer rely on the average human source of data to improve your model like obviously once your model performs as well as an average human it like it's no like it's no longer valuable to collect data from average humans because they don't add to the model's performance you need to start going to exceptional humans start going to um experts in particular fields or outliers in terms of performance at particular things um and eventually you run into the bottleneck of Humanity's performance on a particular task and then you can't go to humans to offer data that outperforms your model because your model is already performing as well as the best human at that at that task and so you need to search for um a way for the model to self-improve and to work with itself and test itself to start getting effects like you saw with Alpha zero where you know obviously it's it's obvious how to do that in a game playing setting right like two copies of the model play against each other they self-improve by each one you know gets a little bit better and the other one has to beat it it's much more difficult to think about how you would do that with a large language model but at some point once these start to reach you know top tier human performance we're gonna have to find ways to enable that sort of interaction and self-improvement interesting um I mean it does appear like these models can do lots of things that the average person can't do or or I can't do I guess one domain where it's kind of easy to sort of it's much easier to check than to to generate is is code generation right I mean I I'm like amazed at the quality of um or maybe the rapid progress and and co-generation it sort of seems to me like it it could be easy to sort of at least know if you got the right answer um for a for a code generation problem does that seem like maybe more attractable than other domains if then yeah I mean writing software is nice that you know you can run it and if it doesn't compile that's a huge issue and um if it runs but outputs the wrong answer that's like very um there's a very clear signal for for Success there um it's much softer in in other areas um so I definitely think the code is a it has a nice properties um for those reasons evaluation you know might be might be easier although it's you know verification of code is like also a super hard problem and there can be like very subtle bugs that your tests don't capture that the model might introduce then you'll miss and then you'll deploy this and um it could have catastrophic consequences so you know in some sense it's nice because it feels like a much harder and more objective yes the model did well or no it didn't although there's Nuance to that whole topic as well um but yeah I also think it's like it's a limited setting as well um there's a lot you can do in code and uh in software development um but if we're trying to completely transform human productivity or you know value creation or whatever um we have to step outside the bounds of code and we need to be able to automate work in in spaces that um have nothing to do with code right um so I think it's like it's a great platform for experimenting and developing with these models um but we we have a lot to do outside of code as well that makes sense um you know I'd love to ask you about your your company cohere right you're also CEO of a company that's that's building these models and and and making them available um I guess what is how would you describe coherence like positioning in the landscape of um companies building and selling these large models yeah so for cohere um like we got started I think three and a half years ago um so that was before gbt three um but I believe after gbt2 um and even back then like the founding mission um it was really just to get this Tech deployed in as many hands uh and as many products as we possibly could um I think my co-founders and I Nick and I been we were obviously part of the folks that got the earliest access to these sorts of models these generative models um and I felt personally like it had pulled my my timelines for like interesting compelling AI way way way forward um like decades um and so it felt like we were getting a glimpse of the future and that um there was a very important technological Frontier that was about to get crossed we weren't seeing things lining up to make these models accessible to people they were behind barriers they were inside of large you know behemoths um and they weren't being put into the hands of developers and Enterprises weren't be being supported in overcoming the barriers to adoption um and so cohere our product company that the whole point is really undo those barriers um put it in more hands make the apis easy to use accessible to folks who are machine learning experts um and then solve all those barriers that enterprises have to adoption like I think the reason why it's taken this long to see the surface area of products and Technology change with llms is because of mundane stuff like data privacy and people trusting these models um and US building awareness so that there's actual consumer demand for no the way that I want to interact with your product is through dialogue right um we're now at a point where it feels like the market is seeing that and Enterprises are starting to see that the consumer is the consumer is going to choose the product that lets them interact with it in the best possible way in the same way that like when I was graduating from high school I chose my bank because it had a mobile app and that's how I wanted to interact with my bank I didn't want to I don't know use my browser or walk into an actual physical location I just wanted it on my phone um in the future I think like the next Generations graduating from high school they're going to choose their Bank based on the one that they don't have to like call in to debug something they can just talk to it they can just chat to it um at you know 3 A.M um so those consumer product decisions are going to be based off of the interfaces that they're able to use products with um and cohere wants to help enable that to help accelerate organizations and adopting that because it's going to become a competitive necessity um so yeah the vision is to really just like support everyone in adopting this Tech and to help accelerate that um to accelerate that adoption so do you um do you offer a way to use these models without the data leaving your own infrastructure yeah totally so we're Cloud agnostic we're not locked into anyone Cloud um and yeah like the data privacy piece is a super important one especially if you're working with customer data or like you know internal documents and secrets um so we can deploy within your VPC no visibility on our side of your data um so yeah data privacy is a core feature for us and I guess um it seems like you you offer you know specific use case based apis but I mean one thing I've been struck by with um gpp3 being publicly available it's just the the plethora of um use cases that seem like maybe possible or new use cases possible I guess like how do you think about that like do you plan to sort of or is there a way to kind of use your models more more flexibly or do you plan to release like tons of apis on you know every possible way that someone might want to um yeah use a model like this so we we have like the general models which are like our Command models um and you can use them for anything within terms of service right like super General whatever you want to do in an extraction summarization just chatting to it like whatever you want to use it for um it'll it'll support it um and if it doesn't support it let us know and you know we have the ability to improve it next week when we launch a new version um so I I think yeah definitely the general purpose nature of the technology is a huge piece of the value prop that you can just go to one model and you can get it to do tons of different tasks for you at the same time there is value to specialized endpoints which we also build stuff like the summarize endpoint that we built the classify endpoint the search endpoint and so those specialized ones are much more targeted and there's only going to be a few of them because we're only going to focus on the most popular use cases um like summarization like search like classification um but those will be very tailored to that use case um so there's there's both there's like the highly General um command style model and then there's specific endpoints targeted at one use case um how do you think about open source in general and sort of opening up um these models more yeah I mean I love open source um I come from research which is inherently or supposed to be inherently opened um at the same time I am trying to build a business something that's sustainable an economic engine that lets us continue to innovate continue to compete and you know giving away your IP for free tends to be a bad business model you disintermediate yourself um and so we've been very hesitant to do that principally from the fact that we want to build something healthy and sustainable and um we want to be doing this you know for the rest of our lives and to do that we need people to pay us for the work that we do um but I'm super super supportive of more open models um and better performing open source models there will always be that category of developer who like they don't want to use an API they want to get right down to the parameters and mess around with that and you know compress the model onto their laptop Etc um and I think there's loads of groups out there that are doing that work and are building um that Foundation I think a Luther carbur stability um there's loads of these folks who are doing that and so I am super happy to see them out there and um I really appreciate the work that they do I think I saw a result um from Stanford recently alpaca where they they sort of like hit an API of an llm quite a bit to the point where they were able to kind of reconstruct the LF for a really small amount of money does does that seem right to you does that approach like worry you that your customers might rebuild your your models um it I think it doesn't seem right I think the result may have been a little bit uh exaggerated or maybe misunderstood by misunderstood by some folks like the performance of that model is super impressive and I think it's ostensibly like a distillation of one of these large models into a swallow model um when you get down to it that's like principally what's happening um and maybe the interesting result is the extent to which you can recover interesting behavior in a small model um but it still leaks behind a larger model um and it's its utility is much more narrow than that large model it might be good at the few things that it was trained to do and it might evaluate well on a specific subset of tasks um but you lose a lot of what makes those big models magic uh you you lose a lot of that generality um but yeah I think like if you pick 15 30 tasks you can train an extremely small model to perform as well as a big model as soon as you narrow in on like a small set of abilities you can get quite good performance there I think that's a that's an exciting truth right because you can go from using these big General models to as soon as you know your use case and as soon as you know the thing you want to do you can scale down massive and get a much cheaper uh version of of that system so I think that's like a an interesting result but I don't think it's fair to say that Stanford result is like the same as the large model like those two are not the same this one that can run on your cell phone it's impressive within a limited domain but it's lost something it's lost some generality some some intelligence relative to the large one it was distilled from so it sounds like your experience of working on these large language models has pulled up um by decades your belief about when they get um like interesting um whatever that means it has for me too to be honest and it I guess it makes me feel like AGI which isn't you know clearly defined but you know there's aspects of it that that seem incredibly important for the world it makes me think that that really could happen in our lifetime when you sort of like platform where the curve I mean I have to ask for you is that like top of mind for you in in your work does that does that seem like something that's coming I I don't think so like I I don't spend I don't spend an outsized amount of my time thinking about HEI um I do spend an outsized amount of my time thinking about how to make models more useful and I think that's along the critical path towards AGI like we're going to build a lot of useful stuff um that's going to make people way more efficient um you know way more productive um that like lofty goal of HEI I think is it's exciting and it's like super Salient like it's very easy to get sucked into it and um it certainly makes your work feel like Monumental in terms of importance um but I don't think you need AGI for this technology to be extremely impactful um and I think there's going to be a lot of there's going to be a lot of other issues along the way deploying these models that don't touch on the issues that maybe AGI makes AGI discourse puts front and center um so in some I'm sort of on the side of the AGI discourse matters and we should be investing time and thought into it but it's completely consumed to the conversation around AI in general to a point where it's it's distracting and a lot of the issues that the AGI community tout as like the largest issues and of pressing importance and worthy of Regulation and slowing down and etc etc etc a lot of those issues I think are um well frankly over overstated um well what do you think then are the the most important pressing issues yeah so I I think stuff like um how these models can change public discourse is really concerning um what the consequences of synthetic media at a scale that we've just never encountered um what that'll have what pressures that'll put on society I think those those issues are much more near-term and plausibly implementable with the technology of today or maybe the next couple years um and so that's really where public attention should be um and I don't see I see very light discussion about that there's a lot of like what happens if the paper clip maximizer turns us all into iron uh to generate more paper clips um and obviously that's an exaggeration unfair characterization but um a lot of the discourse has that flavor and it's very far future and disconnected from the present reality of the technology in the near term reality of the technology um and so I would be really excited to see like social media uh like us as a public putting pressure on social media to implement you know verification of who's posting how do I know this is a bot or not right um I really want that filter I want to be able to filter for I'm hearing from humans reading the opinions of humans not uh you know a language model um but that doesn't seem to be happening that conversation seems to be quite Niche or restricted to um very small communities I think the broader public needs to start demanding that as a feature because I think there will be a flood of content synthetic content um yeah it's funny though I I would have expected a flood of synthetic content already at this point like the the quality seems very high to me of synthetic content and there's lots of ways to do it and it's it's pretty cheap yeah I think it's like um I think it's an awareness thing um I think it might already be happening and it's such compelling text that it's hard to pick up um like it might be difficult for you and I to appreciate when we click on a tweet it's popular and we read through some of those replies the extent to which those are machine generated because you just intuitively believe I'm staring at a bunch of humans giving their thoughts and opinions on on this tweet you just trust that right that's just what social media is and um similarly like your emails right like the flood of emails that you're getting um you read emails written by someone that you think is you know trying to Market to you and they're speaking very eloquently and fluently and your spam catcher didn't flag this because it's very compelling and it seems targeted specifically to you and it's been observing all the emails coming into Gmail servers and no like this one isn't just a template of another one it's specifically written to you so it must be human um but it's not right and so I think it's very easy for this to slip past um because of exactly the reason you're describing which is like these models are extremely fluent um they write very coherently um and yet these models are people and so if they're pressing some idea in response to some political tweet um we don't want that we don't we don't want synthetic amplification of some position do you have any other um things that you think are going to change in the next year or two based on stuff that's kind of already happened in the capability of these models like I'll say like one example from from my perspective is I could imagine a lot more chat-based interfaces into to products like I think that actually is like a nice interface when it actually works and I feel like I'm starting to see these incredibly evocative demos of of like using things with just uh the chat interface I'm curious from from your perspective um do you think our modes of interacting um with computers is is likely to significantly change um definitely definitely like it's it's important to remember that the end of November uh when chat GPT came out that was like for most people who interacted with that product um that was the first time for most humans that they had a compelling conversation with silicon every other moment in their life they had only ever experienced that with people um and I think for those of us who are like in the field building these models um it can be like the frog in the pot where nothing's ever surprising and um it's all one small step from the step behind um but for most people it was like that was the first time they had a conversation with a computer I just like there's a first time a human talked to a piece of silica um so I think it's important to remember like how massive of a leap that is and and also to think about what that unlocks um I think it's going to become much much more common that the default way you interact with a product or a piece of technology is going to be through conversation instead of having to like go through 10 layers of menus or whatever to find the thing that you want to do you're just going to have a conversation with that agent um and it has access to the ability to affect the change that you're asking it to do um it's just so much more convenient to talk to a product than it is too um you know learn a complex GUI and on board in that way um so I think this unlock of like dialogue is an interface onto the space of products it's just totally a transformation in how we we interact with r the stuff we build the systems we build so you haven't raised them the massive funding around so we've raised fair amount of money but not at the scale of like open AI or anthropic does what they're doing make you nervous like do are you kind of consciously trying to not enter like an arms race of you know building the most compute intensive model ever or you know do you sort of like plan you know to enter that that realm or like what what are your sort of like thoughts and and plans there yeah so I mean we we have raised a lot of money um not on the level of 10 billion um and we also haven't gone to you know individuals or patrons uh to raise money um I haven't made friends with a bunch of billionaires and gotten them to to write checks and to go here um I think our L prerogative has always been to build a company like the right way um and to raise money at healthy Milestones when we approved value creation when we can convince an Institutional Investor that um you know we hit these milestones we need this much money to hit this next Milestone um and so we don't do those flashy big rounds from one strategic or one Patron or one benefactor um I think principally because we're building a company in a different way um and I think there are there's there's more Independence afforded to you by doing that if you're completely beholden to one entity or a small pool of entities um can lead to problems like yeah it unlocks massive Capital um but I think cohero is a proof point that you know actually like you don't need 10 billion dollars to build an extremely compelling very smart model um I think we're a proof point that you don't need that if you're Scrappy and capital efficient and you have like a super motivated talented team um but we also don't want to take those shortcuts we don't want to like sell out and basically give half of our company to one of the tech behemoths and become a subsidiary we want to stay independent and we want to build a company a new company um like a healthy kind of I guess the normal way it's weird that we're an outlier because we raise Capital like a normal good healthy business but I guess we are but I guess like you know in a world where there are people doing that like I even think it like weights and biases you know we might you know run leaner and slower but you know we kind of react to a world where um the space is growing fast and and certainly we have also well-funded competitors we want to make sure that we're you know uh I guess like it feels to me like the space that weights and bias is in is probably winner take all or winner take most do you not believe that's the case with um the space that that you're in like these sort of imaginary world or there's lots of different um you know Foundation models that that do different things like why why would why would the world go in that in that direction I mean I sure hope that it's not um a monopoly um yeah I definitely don't believe that it can be um like our our competition I think they're great they're super talented teams um they build great models um but you know they have their own flavor of doing it they have their own way of doing it they have their own concerns um that they're optimizing for um I think our flavor is different right and that's healthy and there should be a bunch of folks with a bunch of different takes um building these models putting them out there and then the market should decide which one they want to adopt who do they feel is the best partner who do they trust the most who do they think is going to help them succeed um and I think for cohere we want to win that on our merits um and I think that the the end state of like this effort for these Foundation model companies like ourselves it's very unlikely that it's Winner Takes all we're all kind of like within a few months of each other [Music] um and so it feels very very unlikely that its Winner Takes all um and Winner Takes all is just bad for the market it's just like it's a bad setup for the people who need to consume these models um and so I'm yeah I'm super optimistic that we'll have diversity and we'll have a handful of of folks building and deploying these models do you think that customers will want to fine-tune these large models going forward or will that oppress go away um there are specific cases where the answer is yes but I think by and large um fine-tuning is like a mature system um feature it's like you want to fine-tune a model after you've exhausted all the other things that you can do to boost its performance um and so it's really only once a system has been deployed optimized optimized optimized and then eventually you landed the fact okay the only way for me to squeeze more performance out of this is by a fine tuning we're probably still too early in the tech tech adoption curve to actually see strong demand for that I think eventually it does arise in the you know in the interim uh the focus for cohere is just make the models so adaptable via um prompting or by grounding or via you know these other methods make them as adaptable as possible without having to customize weights um so just give other levers for people to pull on to squeeze better performance out of the model a killer almost at a time but before I let you go can you tell me one thing that's been surprisingly hard and so many things seem hard but maybe something that people wouldn't necessarily know is hard about building these large models and building nhi around them that that customers actually use for Mission critical stuff oh yeah interesting what's something surprising and hard um because I feel like a lot of people know how hard it is to like train the model itself it's obviously thousands of accelerators trying to keep that thing up is like really really difficult um sounds stressful too you're spending so much money yeah yeah I like the in the gpt4 post there was like the model babysitters that's a real thing like that's like we have people who just sit and watch models train so that we can bring them back up when they inevitably fail um okay something surprising about this whole thing the importance I'll tell you if it surprised me you know the importance and the sensitivity to data I think was a real shock for me this year like as we really started scaling up our our data collection efforts from humans as opposed to the web um like the web when we were collecting data there the model is super robust to noise you can kind of get away with just some like weak heuristics or um and really just want to like throw as much as you can in there um as soon as we went into the human data collection phase like one example or two examples like you only need to like mess up a couple times and suddenly you've sent the model down a direction that you really don't want it to go it's extremely extremely sensitive if you teach it something that's wrong it will lock that in and just forever believe that wrong thing and so that is surprising that's very surprising I I yeah yeah I'm quite surprised by that yeah yeah so like the the sensitivity there I just I was not prepared for how delicate these models are um okay actually final question is this some other topic that you're interested in that if you had some extra time you would look into some topic around machine learning that you wish you had more time to explore around machine learning I would probably be in robotics and embodiment like I I just think that's so cool and it's like there is such a strong consumer demand for that like we all know what that the imaginary success of extremely intelligent brain plus extremely capable body um we know what that would be like and how transformational that would be to have in our lives to have access to um and it feels like we're really far off so I would love to effect change there and help build that um yeah robotics is super sick totally agree love it thank you so excited that was really fun yeah thank you so much yeah that was great if you're enjoying these interviews and you want to learn more please click on the link to the show notes in the description where you can find links to all the papers that are mentioned supplemental material and a transcription that we work really hard to produce so check it out,9869