Spaces:

charlesfrye
/

ask-fsdl

Runtime error

App Files Files Community

ask-fsdl / documents /lecture-08.md

charlesfrye

adds documents

a08f3cd about 2 years ago

preview code

raw

history blame

27.5 kB

	---
	description: Building ML-powered products and the teams who create them
	---

	# Lecture 8: ML Teams and Project Management

	<div align="center">
	<iframe width="720" height="405" src="https://www.youtube-nocookie.com/embed/a54xH6nT4Sw?list=PL1T8fO7ArWleMMI8KPJ_5D5XSlovTW_Ur" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
	</div>

	Lecture by [Josh Tobin](https://twitter.com/josh_tobin_).
	Notes by [James Le](https://twitter.com/le_james94) and [Vishnu Rachakonda](https://www.linkedin.com/in/vrachakonda/).<br />
	Published September 26, 2022.
	[Download slides](https://fsdl.me/2022-lecture-08-slides).

	## 0 - Why is this hard?

	Building any product is hard:

	- You have to hire great people.

	- You have to manage and develop those people.

	- You have to manage your team's output and make sure your vectors are
	aligned.

	- You have to make good long-term technical choices and manage
	technical debt.

	- You have to manage expectations from leadership.

	- You have to define and communicate requirements with stakeholders.

	Machine Learning (ML) adds complexity to that process:

	- ML talent is expensive and scarce.

	- ML teams have a diverse set of roles.

	- Projects have unclear timelines and high uncertainty.

	- The field is moving fast, and ML is the "[high-interest credit card
	of technical
	debt](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf)."

	- Leadership often doesn't understand ML.

	- ML products fail in ways that are hard for laypeople to understand.

	In this lecture, we'll talk about:

	1. ML-related roles and their required skills.

	2. How to hire ML engineers (and how to get hired).

	3. How ML teams are organized and fit into the broader
	organization.

	4. How to manage an ML team and ML products.

	5. Design considerations for ML products.

	## 1 - Roles

	### Common Roles

	Let's look at the most common ML roles and the skills they require:

	- The ML Product Manager works with the ML team, other business
	functions, the end-users, and the data owners. This person designs
	docs, creates wireframes, and develops a plan to prioritize and
	execute ML projects.

	- The MLOps/ML Platform Engineer builds the infrastructure to make
	models easier and more scalable to deploy. This person handles the
	ML infrastructure that runs the deployed ML product using
	platforms like AWS, GCP, Kafka, and other ML tooling vendors.

	- The ML Engineer trains and deploys prediction models. This
	person uses tools like TensorFlow and Docker to work with
	prediction systems running on real data in production.

	- The ML Researcher trains prediction models, often those that are
	forward-looking or not production-critical. This person uses
	libraries like TensorFlow and PyTorch on notebook environments to
	build models and reports describing their experiments.

	- The Data Scientist is a blanket term used to describe all of the
	roles above. In some organizations, this role entails answering
	business questions via analytics. This person can work with
	wide-ranging tools from SQL and Excel to Pandas and Scikit-Learn.

	![](./media/image9.png)

	### Skills Required

	What skills are needed for these roles? The chart below displays a nice
	visual - where the horizontal axis is the level of ML expertise and the
	size of the bubble is the level of communication and technical writing
	(the bigger, the better).

	![](./media/image4.png)

	- The MLOps is primarily a software engineering role, which often
	comes from a standard software engineering pipeline.

	- The ML Engineer requires a rare mix of ML and Software
	Engineering skills. This person is either an engineer with
	significant self-teaching OR a science/engineering Ph.D. who works
	as a traditional software engineer after graduate school.

	- The ML Researcher is an ML expert who usually has an MS or Ph.D.
	degree in Computer Science or Statistics or finishes an industrial
	fellowship program.

	- The ML Product Manager is just like a traditional Product
	Manager but with a deep knowledge of the ML development process
	and mindset.

	- The Data Scientist role constitutes a wide range of backgrounds,
	from undergraduate to Ph.D. students.

	There is an important distinction between a task ML engineer and a
	platform ML engineer, coined by Shreya Shankar in [this blog
	post](https://www.shreya-shankar.com/phd-year-one/):

	1. Task ML engineers are responsible for maintaining specific ML
	pipelines. They only focus on ensuring that these ML models are
	healthy and updated frequently. They are often overburdened.

	2. Platform ML engineers help task ML engineers automate tedious
	parts of their jobs. They are called MLOps/ML Platform engineers
	in our parlance.

	## 2 - Hiring

	### The AI Talent Gap

	In 2018 (when we started FSDL), the AI talent gap was the main story.
	There were so few people who understood this technology, so the biggest
	block for organizations was that they couldn't find people who were good
	at ML.

	In 2022, the AI talent gap persists. But it tends to be less of a
	blocker than it used to be because we have had four years of folks
	switching careers into ML and software engineers emerging from
	undergraduate with at least a couple of ML classes under their belts.

	The gap tends to be in folks that understand more than just the
	underlying technology but also have experience in seeing how ML fails
	and how to make ML successful when it's deployed. That's the reality of
	how difficult it is to hire ML folks today, especially those with
	production experience.

	### Sourcing

	Because of this shallow talent pool and the skyrocketing demand, hiring
	for ML positions is pretty hard. Typical ML roles come in the following
	structure:

	- ML Adjacent roles: ML product manager, DevOps, Data Engineer

	- Core ML Roles: ML Engineer, ML Research/ML Scientist

	- Business analytics roles: Data Scientist

	For ML-adjacent roles, traditional ML knowledge is less important, as
	demonstrated interest, conversational understanding, and experience can
	help these professionals play an impactful role on ML teams. Let's focus
	on how to hire for the core ML roles.

	![](./media/image6.png)


	While there's no perfect way to hire ML engineers, there's
	definitely a wrong way to hire them, with extensive job descriptions
	that demand only the best qualifications (seen above). Certainly, there
	are many good examples of this bad practice floating around.

	- Rather than this unrealistic process, consider hiring for software
	engineering skills, an interest in ML, and a desire to learn. You
	can always train people in the art and science of ML, especially
	when they come with strong software engineering fundamentals.

	- Another option is to consider adding junior talent, as many recent
	grads come out with good ML knowledge nowadays.

	- Finally, and most importantly, be more specific about what you need
	the position and professional to do. It's impossible to find one
	person that can do everything from full-fledged DevOps to
	algorithm development.

	To hire ML researchers, here are our tips:

	- Evaluate the quality of publications, over the quantity, with an eye
	toward the originality of the ideas, the execution, etc.

	- Prioritize researchers that focus on important problems instead of
	trendy problems.

	- Experience outside academia is also a positive, as these researchers
	may be able to transition to industry more effectively.

	- Finally, keep an open mind about research talent and consider
	talented people without PhDs or from adjacent fields like physics,
	statistics, etc.

	To find quality candidates for these roles, here are some ideas for
	sourcing:

	- Use standard sources like LinkedIn, recruiters, on-campus
	recruiting, etc.

	- Monitor arXiv and top conferences and flag the first authors of
	papers you like.

	- Look for good implementations of papers you like.

	- Attend ML research conferences (NeurIPS, ICML, ICLR).

	![](./media/image7.png)

	As you seek to recruit, stay on top of what professionals want and make
	an effort to position your company accordingly. ML practitioners want to
	be empowered to do great work with interesting data. Building a culture
	of learning and impact can help recruit the best talent to your team.
	Additionally, sell sell sell! Talent needs to know how good your team is
	and how meaningful the mission can be.

	### Interviewing

	As you interview candidates for ML roles, try to **validate your
	hypotheses of their strengths while testing a minimum bar on weaker
	aspects**. For example, ensure ML researchers can think creatively about
	new ML problems while ensuring they meet a baseline for code quality.
	It's essential to test ML knowledge and software engineering skills for
	all industry professionals, though the relative strengths can vary.

	The actual ML interview process is much less well-defined than software
	engineering interviews, though it is modeled off of it. Some helpful
	inclusions are projects or exercises that test the ability to work with
	ML-specific code, like take-home ML projects. Chip Huyen's
	"[Introduction to ML Interviews
	Book](https://huyenchip.com/ml-interviews-book/)" is a
	great resource.

	### Finding A Job

	To find an ML job, you can take a look at the following sources:

	- Standard sources such as LinkedIn, recruiters, on-campus recruiting,
	etc.

	- ML research conferences (NeurIPS, ICLR, ICML).

	- Apply directly (remember, there's a talent gap!).

	Standing out for competitive roles can be tricky! Here are some tips (in
	increasing order of impressiveness) that you can apply to differentiate
	yourself:

	1. Exhibit ML interest (e.g., conference attendance, online course
	certificates, etc.).

	2. Build software engineering skills (e.g., at a well-known software
	company).

	3. Show you have a broad knowledge of ML (e.g., write blog posts
	synthesizing a research area).

	4. Demonstrate ability to get ML projects done (e.g., create side
	projects, re-implement papers).

	5. Prove you can think creatively in ML (e.g., win Kaggle competitions,
	publish papers).

	## 3 - Organizations

	### Organization Archetypes

	There exists not yet a consensus on the right way to structure an ML
	team. Still, a few best practices are contingent upon different
	organization archetypes and their ML maturity level. First, let's see
	what the different ML organization archetypes are.

	Archetype 1 - Nascent and Ad-Hoc ML

	- These are organizations where no one is doing ML, or ML is done on
	an ad-hoc basis. Obviously, there is little ML expertise in-house.

	- They are either small-to-medium businesses or less
	technology-forward large companies in industries like education or
	logistics.

	- There is often low-hanging fruit for ML.

	- But there is little support for ML projects, and it's challenging to
	hire and retain good talent.

	Archetype 2 - ML R&D

	- These are organizations in which ML efforts are centered in the R&D
	arm of the organization. They often hire ML researchers and
	doctorate students with experience publishing papers.

	- They are larger companies in sectors such as oil and gas,
	manufacturing, or telecommunications.

	- They can hire experienced researchers and work on long-term business
	priorities to get big wins.

	- However, it is very difficult to get quality data. Most often, this
	type of research work rarely translates into actual business
	value, so usually, the amount of investment remains small.

	Archetype 3 - ML Embedded Into Business and Product Teams

	- These are organizations where certain product teams or business
	units have ML expertise alongside their software or analytics
	talent. These ML individuals report up to the team's
	engineering/tech lead.

	- They are either software companies or financial services companies.

	- ML improvements are likely to lead to business value. Furthermore,
	there is a tight feedback cycle between idea iteration and product
	improvement.

	- Unfortunately, it is still very hard to hire and develop top talent,
	and access to data and compute resources can lag. There are also
	potential conflicts between ML project cycles and engineering
	management, so long-term ML projects can be hard to justify.

	Archetype 4 - Independent ML Function

	- These are organizations in which the ML division reports directly to
	senior leadership. The ML Product Managers work with Researchers
	and Engineers to build ML into client-facing products. They can
	sometimes publish long-term research.

	- They are often large financial services companies.

	- Talent density allows them to hire and train top practitioners.
	Senior leaders can marshal data and compute resources. This gives
	the organizations to invest in tooling, practices, and culture
	around ML development.

	- A disadvantage is that model handoffs to different business lines
	can be challenging since users need the buy-in to ML benefits and
	get educated on the model use. Also, feedback cycles can be slow.

	Archetype 5 - ML-First Organizations

	- These are organizations in which the CEO invests in ML, and there
	are experts across the business focusing on quick wins. The ML
	division works on challenging and long-term projects.

	- They are large tech companies and ML-focused startups.

	- They have the best data access (data thinking permeates the
	organization), the most attractive recruiting funnel (challenging
	ML problems tends to attract top talent), and the easiest
	deployment procedure (product teams understand ML well enough).

	- This type of organization archetype is hard to implement in practice
	since it is culturally difficult to embed ML thinking everywhere.

	### Team Structure Design Choices

	Depending on the above archetype that your organization resembles, you
	can make the appropriate design choices, which broadly speaking follow
	these three categories:

	1. Software Engineer vs. Research: To what extent is the ML team
	responsible for building or integrating with software? How
	important are Software Engineering skills on the team?

	2. Data Ownership: How much control does the ML team have over data
	collection, warehousing, labeling, and pipelining?

	3. Model Ownership: Is the ML team responsible for deploying models
	into production? Who maintains the deployed models?

	Below are our design suggestions:

	If your organization focuses on ML R&D:

	- Research is most definitely prioritized over Software Engineering
	skills. Because of this, there would potentially be a lack of
	collaboration between these two groups.

	- ML team has no control over the data and typically will not have
	data engineers to support them.

	- ML models are rarely deployed into production.

	If your organization has ML embedded into the product:

	- Software Engineering skills will be prioritized over Research
	skills. Often, the researchers would need strong engineering
	skills since everyone would be expected to product-ionize his/her
	models.

	- ML teams generally do not own data production and data management.
	They will need to work with data engineers to build data
	pipelines.

	- ML engineers totally own the models that they deploy into
	production.

	If your organization has an independent ML division:

	- Each team has a potent mix of engineering and research skills;
	therefore, they work closely together within teams.

	- ML team has a voice in data governance discussions, as well as a
	robust data engineering function.

	- ML team hands-off models to users but is still responsible for
	maintaining them.

	If your organization is ML-First:

	- Different teams are more or less research-oriented, but in general,
	research teams collaborate closely with engineering teams.

	- ML team often owns the company-wide data infrastructure.

	- ML team hands the models to users, who are responsible for operating
	and maintaining them.

	The picture below neatly sums up these suggestions:

	![](./media/image12.png)

	## 4 - Managing

	### Managing ML Teams Is Challenging

	The process of actually managing an ML team is quite challenging for
	four reasons:

	1. Engineering Estimation: It's hard to know how easy or hard an ML
	project is in advance. As you explore the data and experiment with
	different models, there is enormous scope for new learnings about
	the problem that materially impact the timeline. Furthermore,
	knowing what methods will work is often impossible. This makes it
	hard to say upfront how long or how much work may go into an ML
	project.

	2. Nonlinear Progress: As the chart below from a [blog
	post](https://medium.com/@l2k/why-are-machine-learning-projects-so-hard-to-manage-8e9b9cf49641)
	by Lukas Biewald (CEO of [Weights and
	Biases](https://wandb.ai/site)) shows, progress on ML
	projects is unpredictable over time, even when the effort expended
	grows considerably. It's very common for projects to stall for
	extended periods of time.

	![](./media/image1.png)

	3. Cultural gaps: The relative culture of engineering and research
	professionals is very different. Research tends to favor novel,
	creative ideas, while engineering prefers tried and true methods
	that work. As a result, ML teams often experience a clash of
	cultures, which can turn toxic if not appropriately managed. A
	core challenge of running ML teams is addressing the cultural
	barriers between ML and software engineering so that teams can
	harmoniously experiment and deliver ML products.

	4. Leadership Deficits: It's common to see a lack of detailed
	understanding of ML at senior levels of management in many
	companies. As a result, expressing feasibility and setting the
	right expectations for ML projects, especially high-priority ones,
	can be hard.

	### How To Manage ML Teams Better

	Managing ML teams is hardly a solved problem, but you can take steps to
	improve the process.

	Plan probabilistically

	Many engineering projects are managed in a waterfall fashion, with the
	sequential tasks defined up front clearly. Instead of forcing this
	method of engineering management on difficult ML projects, try assigning
	a likelihood of success to different tasks to better capture the
	experimental process inherent to ML engineering. As these tasks progress
	or stall, rapidly re-evaluate your task ordering to better match what is
	working. Having this sense of both (1) **how likely a task is to
	succeed and (2) how important it is** makes project planning
	considerably more realistic.

	![](./media/image10.png)


	Have a portfolio of approaches

	Embrace multiple ideas and approaches to solve crucial research
	challenges that gate production ML. Don't make your plan dependent on
	one approach working!

	Measure inputs, not results

	As you work through several approaches in your portfolio, do not overly
	emphasize whose ideas ultimately work as a reflection of contribution
	quality. This can negatively impact team members' creativity, as they
	focus more on trying to find only what they currently think could work,
	rather than experimenting in a high-quality fashion (which is ultimately
	what leads to ML success).

	Have researchers and engineers work together

	The collaboration between engineering and research is essential for
	quality ML products to get into production. Emphasize collaboration
	across the groups and professionals!

	Get quick wins

	Taking this approach makes it more likely that your ML project will
	succeed in the long term. It allows you to demonstrate progress to your
	leadership more effectively and clearly.

	Educate leadership on uncertainty

	This can be hard, as leadership is ultimately accountable for addressing
	blind spots and understanding timeline risk. There are things you can
	do, however, to help improve leadership's knowledge about ML timelines.

	- Avoid building hype around narrow progress metrics material only to
	the ML team (e.g., "*We improved F1 score by 0.2 and have achieved
	awesome performance!*").

	- Instead, be realistic, communicate risk, and emphasize real product
	impact (e.g., "Our model improvements should increase the number
	of conversions by 10%, though we must continue to validate its
	performance on additional demographic factors.)

	- Sharing resources like [this a16z primer](https://a16z.com/2016/06/10/ai-deep-learning-machines/),
	[this class from Prof. Pieter
	Abbeel](https://executive.berkeley.edu/programs/artificial-intelligence),
	and [this Google's People + AI
	guidebook](https://pair.withgoogle.com/guidebook) can
	increase awareness of your company's leadership.

	### ML PMs are well-positioned to educate the organization

	There are two types of ML product managers.

	1. Task PMs: These are the more common form of ML PM. They are
	generally specialized into a specific product area (e.g. trust and
	safety) and have a strong understanding of the particular use
	case.

	2. Platform PMs: These are a newer form of PMs. They have a broader
	mandate to ensure that the ML team (generally centralized in this
	context) is highest leverage. They manage workflow and priorities
	for this centralized team. To support this, they tend to have a
	broad understanding of ML themselves. These PMs are critical for
	educating the rest of the company about ML and ensuring that teams
	trust the output of models.

	Both types of PMs are crucial for ML success. Platform PMs tend to have
	a particularly powerful role to play in pushing an organization's
	adoption of machine learning and making it successful.

	### What is "Agile" for ML?

	There are two options similar to what Agile is for software development
	in the ML context. They are shown below:

	![](./media/image2.png)


	They are both structured, data-science native approaches to project
	management. You can use them to provide standardization for project
	stages, roles, and artifacts.

	TDSP tends to be more structured and is a strong alternative to the
	Agile methodology. CRISP-DM is somewhat higher level and does not
	provide as structured a project management workflow. If you genuinely
	have a large-scale coordination problem, you can try these frameworks,
	but don't otherwise. They can slow you down since they are more oriented
	around "traditional" data science and not machine learning.

	## 5 - Design

	Let's talk about how to actually design machine learning products now.
	The biggest challenge with designing such products often isn't
	implementing them; it's **bridging the gap between users' inflated
	expectations and the reality**.

	Users often expect extremely sophisticated systems capable of solving
	many more problems than they actually can.

	![](./media/image11.png)

	In reality, machine learning systems are more like dogs that are trained
	to do a special task; weird little guys with a penchant for distraction
	and an inability to do much more than they are explicitly told.

	![](./media/image13.png)

	All this leads to a big gap between what can be done and what users
	expect!

	### The Keys to Good ML Product Design

	In practice, **good ML product design bridges users expectations and
	reality**. If you can help users understand the benefits and limitations
	of the model, they tend to be more satisfied. Furthermore, always have
	backup plans for model failures! Over-automating systems tends to be a
	recipe for unhappy users. Finally, building in feedback loops can really
	increase satisfaction over time.

	There are a couple ways to explain the benefits and limitations of
	an ML system to users.

	- Focus on the problems it solves, not the fact that the system is
	"AI-powered".

	- If you make the system feel "human-like" (unconstrained input,
	human-like responses), expect users to treat it as human-like.

	- Furthermore, seek to include guardrails or prescriptive interfaces
	over open-ended, human-like experiences. A good example of the
	former approach is [Amazon
	Alexa](https://alexa.amazon.com/), which has specific
	prompts that its ML system responds to.

	![](./media/image5.png)


	Handling failures is a key part of keeping ML systems users happy.
	There's nothing worse than a "smart" system that conks out when you do
	something slightly unexpected. Having built-in solutions to solve for
	automation issues is extremely important. One approach is letting users
	be involved to correct improper responses. Another is to focus on the
	notion of "model confidence" and only offer responses when the threshold
	is met. A good example of a handling failure approach is how Facebook
	recommends photo tags for users, but doesn't go so far as to autoassign.

	### Types of User Feedback

	How can you collect feedback from users in a way that avoids these
	issues? There are different types of user feedback and how they help
	with model improvement.

	![](./media/image3.png)


	Let's go across this chart.

	1. The simplest form of feedback is indirect implicit feedback. For
	example, did the user churn from the product? That tells you
	immediately how the user felt about the system without them giving
	a clear signal themselves.

	2. Another form is direct implicit feedback, which involves the
	user "taking the next step". For example, in an automated user
	onboarding flow, did the user click through into ensuing steps?
	This is trickier to implement, but can be useful for future
	training iterations.

	3. The next type of feedback is binary explicit feedback, wherein
	users are specifically asked (e.g. via thumbs up/down buttons) how
	they feel about the model performance.

	4. You can make this more sophisticated and add **categorical explicit
	feedback**, which allows users to sort their feedback into various
	types.

	5. To really get a sense of how users feel, consider offering **free
	text feedback**. This is tricky to use for model training and can
	be involved for users, but it's very useful to highlight the
	highest friction predictions.

	6. The gold standard, of course, are model corrections; they are
	free labels!

	Whenever building explicit feedback into ML systems, avoid relying on
	users' altruism and be clear about why they should engage in the
	feedback. Instead, build positive feedback loops by allowing users to
	experience the benefits of their feedback quickly.

	Great ML product experiences are designed from scratch. ML is a very
	specific technology with clear advantages and drawbacks. Design needs to
	be thoughtfully executed around these products. It's especially
	important to allow users to interact safely with ML products that may
	fail in unexpected ways. Always try to find ways to build in feedback
	loops to make the ML product better over time.

	There are tons of resources that can help you get started with this
	emerging field.

	- [Google's People + AI
	Guidebook](https://pair.withgoogle.com/guidebook)

	- [Guidelines for Human-AI
	Interaction](https://dl.acm.org/doi/abs/10.1145/3290605.3300233)

	- [Agency Plus Automation: Designing AI into Interactive
	Systems](http://idl.cs.washington.edu/files/2019-AgencyPlusAutomation-PNAS.pdf)

	- [Designing Collaborative
	AI](https://medium.com/@Ben_Reinhardt/designing-collaborative-ai-5c1e8dbc8810)

	In conclusion, we talked through a number of adjacent considerations to
	building ML systems and products. In short, you ship the team as much
	you do the code; be thoughtful about how you hire, manage, and structure
	ML teams as much as ML products!

	![](./media/image8.png)