Optimizing Convolutional Neural Networks with Mojo - Part 1

Community blog post
Published September 20, 2023

When Modular announced Mojo, it made some insane claims like speedup of upto 68000 times when compared to Python. We know that in the world of deep learning and artificial intelligence, Convolutional Neural Networks (CNNs) have risen to prominence as a powerful tool for various tasks, particularly in image and signal processing. In this blog series, we will delve into what CNNs are, how far can we optimize it with Mojo, how much speedup we can squeeze against Python and explore why CNNs are a compelling choice from a compute perspective.

What are Convolutional Neural Networks (CNNs)?

Convolutional Neural Networks, or CNNs for short, are a class of deep neural networks designed to process structured grid data efficiently. They have achieved remarkable success in tasks such as image classification, object detection, and even natural language processing. CNNs are especially well-suited for tasks that involve grid-like data, such as images, where spatial relationships matter.

The key components of CNNs include:

Convolutional Layers: These layers apply convolutional filters (kernels) to the input data. Convolution involves sliding the kernel over the input, performing element-wise multiplications and summing the results. This operation captures local patterns and features in the data.

Pooling Layers: Pooling layers downsample the feature maps produced by convolutional layers. Common pooling methods include max pooling and average pooling, which help reduce the spatial dimensions while retaining important information.

Fully Connected Layers: These layers connect all neurons in one layer to all neurons in the next layer, similar to traditional neural networks. Fully connected layers are often used in the final layers of the network for classification or regression tasks.

Activation Functions: Activation functions, such as ReLU (Rectified Linear Unit), introduce non-linearity to the network, allowing it to learn complex relationships in the data.

Now, let's explore why CNNs are an excellent choice from a compute perspective.

Why Choose CNNs from a Compute Perspective?

  1. Parallelism: CNNs are highly parallelizable. Convolution operations can be performed independently for different regions of the input, making them ideal for parallel computing on GPUs and other specialized hardware. This parallelism significantly accelerates training and inference times.

  2. Vectorization: Modern CPUs and GPUs are equipped with vectorized instructions (SIMD and SIMT, respectively), which allow for efficient element-wise operations on data. CNN operations like element-wise multiplications and activations can be optimized for vectorization, resulting in substantial speedups.

  3. Localized Computations: CNNs focus on local patterns and features, which reduces the computational complexity compared to fully connected networks. This localized approach minimizes the number of parameters and computations required, making CNNs more computationally efficient.

  4. Hierarchical Feature Learning: CNNs employ a hierarchical architecture where lower layers capture simple features like edges and textures, while higher layers learn complex patterns and objects. This hierarchy reduces the need for exhaustive computations at every layer, improving efficiency.

  5. Transfer Learning: CNNs can leverage pre-trained models on large datasets. This transfer learning approach allows you to fine-tune models for specific tasks, saving both training time and computational resources.

Now, let's implement a simple Convolutional Neural Network from scratch both in Python and Mojo without using any external libraries.

# Example kernel (3x3)
kernel = [[1, 0, -1],
          [1, 0, -1],
          [1, 0, -1]]

# Define a convolution operation
def convolution(input_data):
    input_height, input_width = len(input_data), len(input_data[0])
    kernel_height, kernel_width = len(kernel), len(kernel[0])
    output_height = input_height - kernel_height + 1
    output_width = input_width - kernel_width + 1
    output = [[0 for _ in range(output_width)] for _ in range(output_height)]
    for i in range(output_height):
        for j in range(output_width):
            output[i][j] = sum(input_data[i+k][j+l] * kernel[k][l] for k in range(kernel_height) for l in range(kernel_width))
    return output

# Example input data (5x5)
input_data = [[1, 2, 3, 4, 5],
              [6, 7, 8, 9, 10],
              [11, 12, 13, 14, 15],
              [16, 17, 18, 19, 20],
              [21, 22, 23, 24, 25]]

# Perform convolution
result = convolution(input_data)
for row in result:

This code demonstrates a basic convolution operation on a 5x5 input matrix using a 3x3 kernel. It's a simplified example to illustrate the concept. In practice, libraries like TensorFlow or PyTorch are used for building and training CNNs due to their efficiency and flexibility.


Convolutional Neural Networks are a compelling choice from a compute perspective due to their parallelism, vectorization capabilities, localized computations, hierarchical feature learning, and potential for transfer learning. These properties make CNNs highly efficient for various tasks, especially those involving structured grid data like images. As we continue through this blog series, we will see the speedup Mojo brings to this CNN over Python and also how can we optimize it for Mojo.