Building Your First Kubeflow Pipeline: A Comprehensive Guide

Community Article Published October 15, 2023

Kubeflow is an open-source platform designed to be end-to-end, facilitating each step of the Machine Learning (ML) workflow. It aims to make deployments of ML workflows on Kubernetes simple, portable, and scalable. One of its most powerful features is Kubeflow Pipelines, a platform for building, deploying, and managing ML workflows based on Docker containers.

Why should you care? The power of Kubeflow Pipelines lies in their ability to automate and streamline the entire machine learning process, from data ingestion to model deployment. This not only saves time but also helps to maintain consistency and quality in your projects.

In this post, we'll explore how to build your first Kubeflow Pipeline from scratch. By the end, you'll have a solid understanding of what Kubeflow is and how you can use it to construct an ML workflow.

Kubeflow and Machine Learning Workflows

Kubeflow is a platform for data scientists and machine learning engineers containing the best of both worlds' functionalities. Data scientists can use Kubeflow to experiment with ML models and orchestrate their experiments on Kubernetes in the most efficient way. Machine learning engineers can use Kubeflow to deploy ML systems to various environments for development, testing, and production serving. The diagram below exemplifies two distinct phases in a machine learning project: (i) the Experimental Phase and (ii) the Production Phase.

drawing — Kubeflow components in the ML workflow

Kubeflow has a lot of different components to support nearly all the steps in the pipeline. For example, for tuning the hyperparameters of the model, Kubeflow has a component called "Katib".

Kubeflow also aligns well with MLOps principles, which aim to bridge the gap between machine learning and operations. By offering a unified workflow, Kubeflow makes it easier to manage ML projects from experimentation to production, incorporating aspects of DevOps and facilitating collaboration between data scientists and operations personnel.

Three Principles of Kubeflow

Composability: Kubeflow is highly composable, allowing you to use different versions of TensorFlow or any other ML libraries for different parts of your machine learning pipeline if needed.
Portability: With Kubeflow, you can run your entire machine learning project anywhere you are running Kubernetes. It abstracts away platform-specific details, enabling you to write your code once and run it whether you're on your laptop or a cloud-based cluster.
Scalability: Kubeflow is designed to scale, providing you with the flexibility to allocate more resources when they're needed and release them when they're not. Whether you're using CPUs, GPUs, or TPUs, Kubeflow helps you make the most of your hardware resources.

Installing Kubeflow

There are two ways to get started with Kubeflow:

Install it using a packaged distribution, which is simple and straightforward. You can find more information on installing Kubeflow with packaged distributions here.
Install it using manifests, which is more advanced. Detailed instructions can be found here.

Packaged distributions are developed and supported by the respective maintainers. For example, Microsoft maintains Kubeflow on Azure. You can find a complete list of distributions in the table below:

You can also refer to the blog titled Kubeflow: How to Install and Launch Kubeflow on your Local Machine for more detailed installation instructions.

After installing, you will have access to the Kubeflow Dashboard, which resembles the figure below.

A Simple Python Script for Demonstration

Before we get into Kubeflow Pipelines, let's create a simple Python script to understand what we're aiming to convert into a pipeline. This script will simulate a very basic ML workflow where we read some data, perform a trivial computation, and save the result.

Here is how you can do it:

# my_script.py

def read_data():
    data = [1, 2, 3, 4, 5]
    return data

def compute_average(data):
    return sum(data) / len(data)

def save_result(value, filename='result.txt'):
    with open(filename, 'w') as f:
        f.write(str(value))

if __name__ == "__main__":
    data = read_data()
    avg = compute_average(data)
    save_result(avg)

This Python script has three functions: one for reading data (read_data), one for computing the average (compute_average), and one for saving the result (save_result). Our goal will be to convert each of these functions into a Kubeflow pipeline component.

In the next section, we'll delve into Kubeflow Components and show you how to build one from this simple Python script.

Understanding Kubeflow Components

Kubeflow Components are the building blocks of a pipeline. Essentially, each component is a self-contained piece of code that performs one step in your ML workflow. It runs independently and does one thing well, like read data, transform features, train a model, or serve an endpoint.

Let's convert our simple Python script into a Kubeflow Component. We'll be using the Kubeflow Pipelines SDK's compiler module to do this.

Creating a Kubeflow Component

First, let's turn each function in our Python script into a separate component.

from kfp import components

def read_data() -> list:
    data = [1, 2, 3, 4, 5]
    return data

def compute_average(data: list) -> float:
    return sum(data) / len(data)

def save_result(value: float, filename: str = 'result.txt'):
    with open(filename, 'w') as f:
        f.write(str(value))

# Compile the component
read_data_op = components.func_to_container_op(func=read_data,
                                output_component_file='read_data_component.yaml',
                                base_image='python:3.7',  # You can specify the base image here
                                packages_to_install=[])  # Any packages that need to be installed can be added here

compute_average_op = components.func_to_container_op(func=compute_average,
                                output_component_file='compute_average_component.yaml',
                                base_image='python:3.7',
                                packages_to_install=[])

components.func_to_container_op(save_result,
                                output_component_file='save_result_component.yaml')

Components are the building blocks of a Kubeflow Pipeline. In our example, we used the func_to_container_op function to convert a Python function into a component. While doing so, you may have noticed two parameters, base_image and packages_to_install.

The base_image parameter specifies the Docker image that will be used as the execution environment for the component. In simpler terms, it's like the operating system of the component. This image should have all the necessary software to run your code.

Why is it Important?: Different codebases may require different runtime environments. For example, if you are working on a TensorFlow project, you may choose a base image that has TensorFlow pre-installed.

Example Usage:

base_image='tensorflow/tensorflow:2.4.0'

Default: If you don't specify a base_image, the default Python image is used, which is a minimal image with Python installed.

The packages_to_install parameter allows you to install additional Python packages needed for your code to run. This is an array of package names that will be installed using pip.

Why is it Important?: Sometimes your code depends on third-party libraries that are not present in the base image. In such cases, you can provide the names of these packages, and they will be installed when the component runs.

Example Usage:

packages_to_install=['numpy', 'pandas']

Note: Be careful when specifying packages, as installing incompatible versions can break your component.

Creating the Kubeflow Pipeline

Having defined our components, the next step is to arrange them into a pipeline. To do this, you'll use Kubeflow's Domain Specific Language (DSL) to link components together. Once you have your components compiled and saved as .yaml files, you're ready to assemble them into a pipeline. For this, we'll write a Python function that uses the Kubeflow Pipelines SDK to define the pipeline's structure.

In Kubeflow Pipelines, a pipeline is essentially a Python function decorated with @dsl.pipeline. Within this function, you can use the components as building blocks. Here's how you can define a Kubeflow Pipeline using our compiled components.

Importing Compiled Components

You can import your compiled components like this:

import kfp
from kfp import dsl

read_data_op = kfp.components.load_component_from_file('read_data_component.yaml')
compute_average_op = kfp.components.load_component_from_file('compute_average_component.yaml')
save_result_op = kfp.components.load_component_from_file('save_result_component.yaml')

Assembling the Pipeline

After loading the components, let's stitch them together to form a pipeline.

@dsl.pipeline(
    name='My first pipeline',
    description='A simple pipeline that computes the average of an array.'
)
def my_pipeline():
    read_data_task = read_data_op()
    compute_average_task = compute_average_op(read_data_task.output)
    save_result_task = save_result_op(compute_average_task.output)

# Compile the pipeline
kfp.compiler.Compiler().compile(my_pipeline, 'my_pipeline.yaml')

This pipeline first reads data using read_data_op, then computes the average using compute_average_op, and finally saves the result using save_result_op.

Additionally, here's a snippet that shows how to pass parameters between components:

@dsl.pipeline(
    name='My parameterized pipeline',
    description='A simple pipeline that reads data and takes a parameter.'
)
def my_pipeline(my_param: int):
    read_data_task = read_data_op()
    another_task = another_component_op(my_param)

This allows you to create more dynamic and configurable pipelines.

In the following sections, we'll look into how to deploy this pipeline and best practices to follow while working with Kubeflow Pipelines.

Deploying the Kubeflow Pipeline

After constructing and compiling your pipeline, the next step is to deploy it onto the Kubeflow platform. This involves uploading the compiled .yaml file and then running the pipeline.

Uploading the Pipeline

Navigate to your Kubeflow dashboard.
Go to the Pipelines section.
Click on Upload Pipeline.
Browse and select the my_pipeline.yaml file.

Once uploaded, you'll see your pipeline listed among any others you may have uploaded.

Running the Pipeline

Click on the pipeline you've just uploaded.
Hit the Run button.
Give your run a name and click Start.

Now you can monitor the pipeline as it progresses through each stage. Successful execution will indicate that your pipeline has been deployed correctly.

Best Practices

When you are working with Kubeflow Pipelines, certain best practices can help you make the most out of the platform. Below are some guidelines to consider for a smoother experience:

Version Control Components

Ensure that every version of your component is well-documented and version-controlled. This will make it easier to debug and update your pipelines in the future.

Error Handling

Make sure to include error-handling mechanisms in your components. This can be done by catching exceptions in the Python code and logging meaningful messages.

# Example of a Python function with error handling
def read_data() -> int:
    try:
        # code to read data
        data = [1, 2, 3, 4, 5]
        return data
    except Exception as e:
        print(f"An error occurred: {e}")
        return -1

Dependency Management

Specify all dependencies explicitly, either in your Dockerfile or as part of your component's metadata.

Component Reusability

Design components to be reusable so that you can plug them into different pipelines as needed.

Monitor Resources

Keep an eye on the resources (CPU, memory, etc.) that your pipeline uses. Optimize components to be as efficient as possible.

In summary, Kubeflow Pipelines offer a streamlined way to take your ML project from a simple script to a robust, end-to-end workflow. We've covered everything from setting up your environment to building and deploying your first Kubeflow Pipeline.

Extending the Basics to Real-World ML Projects

So far, our examples have been extremely basic for the sake of clarity. However, don't underestimate the power of Kubeflow Pipelines; the principles we've covered scale impressively to real-world machine learning projects.

Real-World Use Cases

Each component you define can represent a step in your typical machine learning workflow. Here's how you can map Kubeflow components to your machine learning project:

Data Collection: The read_data component can be expanded to collect data from various sources like databases, Excel files, or APIs.
Preprocessing: You can have another component for data cleaning and preprocessing, transforming the raw data into a format that can be fed into ML models.
Data Splitting: A component could be used for splitting the dataset into training, validation, and test sets.
Model Training: Here, you can utilize a component to train a model using the preprocessed training set.
Evaluation: Lastly, a component can evaluate the model using various metrics to understand how well it performs.

Example

Let's say you have Python functions for each of these steps:

read_data()
preprocess_data()
train_test_split()
train_model()
evaluate_model()

Each of these functions can be converted to a Kubeflow component using func_to_container_op. Once they are components, you can arrange them in a pipeline just like we did with our simple read_data and compute_average components. This enables you to automate the entire machine learning workflow!

And that's a wrap! Hopefully, you now have a solid foundation to start building your own Kubeflow Pipelines, whether it's for simple tasks or complex machine learning workflows. Remember, the sky's the limit when it comes to what you can achieve with Kubeflow!

We've covered a lot of ground in this post—from setting up your environment to building and deploying your very first Kubeflow Pipeline. By now, you should have a solid understanding of what Kubeflow is, what Kubeflow Pipelines are, and how they fit into the bigger picture of Machine Learning workflows.

Kubeflow Pipelines are an essential tool for automating and scaling your ML workflows. As you dive deeper into ML projects, the ability to create robust, scalable pipelines will become increasingly valuable.

For more information and further study, feel free to visit the Kubeflow official documentation.

References

For further reading and exploration, you might find the following resources useful:

Thank you for reading! Feel free to share this guide and spread the knowledge.

Happy Learning! 🚀

If you have any questions or want to contact me, all my social media accounts are on this link.

You can also follow my other blog posts on my website.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote