Hearing is Believing: Revolutionizing AI with Audio Classification via Computer Vision

Community Article Published October 22, 2023

Introduction

In the realm of artificial intelligence and machine learning, we often hear about computer vision as a powerful tool for processing and understanding visual data. But what if we told you that computer vision can do more than just "see" things? Imagine a world where computers can listen, understand, and categorize sounds. Welcome to the fascinating domain of audio classification with computer vision. In this article, we will explore this remarkable synergy of sight and sound, delving into how computer vision is expanding its horizons to include the auditory realm.

The Convergence of Visual and Auditory Intelligence

Traditionally, computer vision focuses on analyzing and interpreting visual data. It has powered applications like facial recognition, object detection, and image segmentation, transforming industries from healthcare to autonomous vehicles. But in a world where data is increasingly multimodal, the ability to incorporate audio data into the mix is a game-changer.

Audio classification with computer vision marries the two domains by applying the principles of visual understanding to audio data. It leverages deep learning techniques to "see" sound, just as it would with images. This innovation has opened doors to a plethora of applications that were once out of reach.

Applications of Audio Classification with Computer Vision

Environmental Sound Analysis: One of the most impactful applications of audio classification is in environmental sound analysis. Computer vision algorithms can process audio data to detect and classify sounds like sirens, alarms, or animal calls. This is invaluable in fields such as urban planning, wildlife conservation, and public safety.
Healthcare: In the medical field, audio classification with computer vision can be used for diagnosing respiratory conditions. It can identify irregular breathing patterns or coughing sounds, enabling early intervention for patients.
Content Moderation: Online platforms and social media networks can use audio classification to detect hate speech, profanity, or inappropriate content in audio messages or live streams. This ensures a safer online environment.
Voice Command Recognition: By integrating audio classification with computer vision, we can build more robust voice command recognition systems. These systems become smarter by recognizing the context in which commands are given.

The Technology Behind Audio Classification

The core technology behind audio classification with computer vision is deep learning. Convolutional Neural Networks (CNNs), a class of deep neural networks primarily used for image analysis, are adapted to process spectrograms or other visual representations of sound. Spectrograms are 2D representations of audio, where time is represented on the x-axis and frequency on the y-axis. By converting audio data into spectrograms, CNNs can be applied to analyze and classify the audio, just as they do with images.

Data Preprocessing

Before we dive into the model development, let's explore the critical steps involved in audio classification with computer vision. Data preprocessing is the first step in making sense of audio data. Raw audio data is converted into a visual format that deep learning models can understand - spectrograms. These are essentially 2D images that represent sound over time.

Visualize the Dataset

The next step is to visualize the dataset. By converting audio data into spectrograms, we can gain insights into the structure of the data. Visualizations can reveal patterns, trends, and potential challenges in the dataset, helping us make informed decisions about data preprocessing and model selection.

Train-Test-Split

With the dataset prepared and visualized, the next step is to split it into training and testing sets. This division is crucial for training the model on one set and evaluating its performance on another. Cross-validation techniques can also be applied to ensure robust model assessment.

Prepare Validation Data

Validation data is essential for fine-tuning the model and preventing overfitting. This data is not used during the training process but helps monitor the model's performance and make necessary adjustments.

Model Training

Model training is where the magic happens. Deep learning models, typically convolutional neural networks (CNNs), are trained on the spectrogram data. The model learns to recognize patterns and unique "fingerprints" of different audio classes. This is where the model becomes proficient at audio classification.

Model Predictions

Once the model is trained, it can make predictions on new, unseen audio data. It can accurately categorize sounds into predefined classes, enabling real-time classification and decision-making.

Code Implementation: Bringing Sound to Sight

In this section, we'll delve into the nuts and bolts of implementing audio classification with computer vision. We'll walk through the key steps involved in preprocessing the data, training a deep learning model, and making predictions. While the following code is a simplified example, it provides a foundation for understanding the process.

Step 1: Install and Import Libraries

The first crucial step is converting raw audio data into spectrograms. Ensure you have the necessary libraries installed:

!pip install -q datasets git+https://github.com/huggingface/transformers split-folders ultralytics

#lets get the dataset
!git clone https://github.com/karolpiczak/ESC-50.git

# import the libraries
import numpy as np
from matplotlib import pyplot as plt
from numpy.lib import stride_tricks
import os
import pandas as pd
import scipy.io.wavfile as wav

Step 2: Visualize the Dataset

Visualizing a dataset is an essential step in data analysis and machine learning. It helps you gain insights into the data, understand its structure, and identify potential patterns or anomalies. In the context of audio classification with computer vision, visualization can be a bit different than traditional numerical data

esc50_df = pd.read_csv('/content/ESC-50/meta/esc50.csv')
esc50_df.head()

def plot_spectrogram(location, categorie, plotpath=None, binsize=2**10, colormap="jet"):
    samplerate, samples = wav.read(location)

    s = fourier_transformation(samples, binsize)

    sshow, freq = make_logscale(s, factor=1.0, sr=samplerate)

    ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel

    timebins, freqbins = np.shape(ims)

    print("timebins: ", timebins)
    print("freqbins: ", freqbins)

    plt.figure(figsize=(15, 7.5))
    plt.title('Class Label: ' + categorie)
    plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")
    plt.colorbar()

    plt.xlabel("time (s)")
    plt.ylabel("frequency (hz)")
    plt.xlim([0, timebins-1])
    plt.ylim([0, freqbins])

    xlocs = np.float32(np.linspace(0, timebins-1, 5))
    plt.xticks(xlocs, ["%.02f" % l for l in ((xlocs*len(samples)/timebins)+(0.5*binsize))/samplerate])
    ylocs = np.int16(np.round(np.linspace(0, freqbins-1, 10)))
    plt.yticks(ylocs, ["%.02f" % freq[i] for i in ylocs])

    if plotpath:
        plt.savefig(plotpath, bbox_inches="tight")
    else:
        plt.show()

    plt.clf()

    return ims

plot = plot_spectrogram('/content/ESC-50/audio/' + esc50_df[esc50_df['category'] == 'crow']['filename'].iloc[0], categorie='Crow')

Step 3: Data processing

conversion = []

for i in range(len(esc50_df.index)):
    
    filename = esc50_df['filename'].iloc[i]
    location = 'dataset/ESC-50/audio/' + filename
    category = esc50_df['category'].iloc[i]
    catpath = 'dataset/ESC-50/spectrogram/' + category
    filepath = catpath + '/' + filename[:-4] + '.jpg'

    conversion.append({location, filepath})

conversion[0]

{'/content/ESC-50/audio/1-100032-A-0.wav',
 '/content/ESC-50/spectrogram/dog/1-100032-A-0.jpg'}

for i in range(len(esc50_df.index)):
    
    filename = esc50_df['filename'].iloc[i]
    location = '/content/ESC-50/audio/' + filename
    category = esc50_df['category'].iloc[i]
    catpath = '/content/ESC-50/spectrogram/' + category
    filepath = catpath + '/' + filename[:-4] + '.jpg'

    os.makedirs(catpath, exist_ok=True)
    
    audio_vis(location, filepath)

Step 4:Train-Test-Split

import splitfolders
input_folder = '/content/ESC-50/spectrogram'
output = '/conent/data'

splitfolders.ratio(input_folder, output=output, seed=42, ratio=(.8, .2))

Step 5: Prepare Validation Data

testing = [
    '/content/data/test/helicopter-fly-over-01.wav'
   
]

def test_vis(location, filepath, binsize=2**10, colormap="jet"):
    samplerate, samples = wav.read(location)

    s = fourier_transformation(samples, binsize)

    sshow, freq = make_logscale(s, factor=1.0, sr=samplerate)

    with np.errstate(divide='ignore'):
        ims = 20.*np.log10(np.abs(sshow)/10e-6) # amplitude to decibel

    timebins, freqbins = np.shape(ims)

    plt.figure(figsize=(15, 7.5))
    plt.imshow(np.transpose(ims), origin="lower", aspect="auto", cmap=colormap, interpolation="none")

    plt.axis('off')
    plt.xlim([0, timebins-1])
    plt.ylim([0, freqbins])
    
    plt.savefig(filepath, bbox_inches="tight")
    plt.close()

    return

test_vis(testing[0], filepath='/content/data/helicopter-fly-over-01.jpg')

Step 6: Classification of Images with YOLO

from ultralytics import YOLO
model = YOLO('yolov8n-cls.pt')

Step 7: Model Training

results = model.train(data='./data', epochs=20, imgsz=640)
metrics = model.val()

Step 8: Model Prediction

pred = model('/content/data/helicopter-fly-over-01.jpg')

image 1/1 /workspace/yolo-listen/data/helicopter-fly-over-01.jpg: 640x640 thunderstorm 0.98, airplane 0.01, crickets 0.01, frog 0.00, wind 0.00, 2.2ms
Speed: 27.3ms preprocess, 2.2ms inference, 0.1ms postprocess per image at shape (1, 3, 640, 640)

Conclusion:

In this exploration of audio classification with computer vision, we've uncovered the innovative synergy between auditory and visual intelligence. This remarkable convergence empowers us to make machines not only "see" but also "hear," opening doors to a wide range of applications in diverse fields. We began by introducing the concept of audio classification with computer vision, highlighting its potential impact in areas such as environmental sound analysis, healthcare, content moderation, and voice command recognition. By marrying the principles of computer vision with audio data, we've embarked on a journey to enhance our understanding of the world through multiple senses. Understanding the technology behind audio classification, we discovered the pivotal role of deep learning, particularly Convolutional Neural Networks (CNNs). These networks are adapted to process spectrograms, the 2D representations of audio data, revealing patterns and "fingerprints" of different sounds. This foundational step is essential for any audio classification task. Moving on to the practical implementation, we covered key aspects of data preprocessing, training-test splitting, model training, and making predictions. Visualizing audio data through spectrograms provided a unique perspective, helping us recognize patterns and structures within the audio data. While YOLO (You Only Look Once) has its merits in the realm of object detection, it is not a natural fit for audio classification, which relies on sequential and time-dependent features. Instead, Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) remain the preferred choices for audio classification, capable of handling the complexity of audio data and its representations. As we journey into the future of audio classification with computer vision, we can anticipate further breakthroughs and innovations. Machine learning will continue to enrich our ability to interpret and interact with the world, not just through visuals but through the harmonious blend of sight and sound. This technology holds promise for a safer, more informed, and interconnected world, redefining how we perceive and understand our surroundings.

“Stay connected and support my work through various platforms:

GitHub: For all my open-source projects and Notebooks, you can visit my GitHub profile at https://github.com/andysingal. If you find my content valuable, don’t hesitate to leave a star.
Patreon: If you’d like to provide additional support, you can consider becoming a patron on my Patreon page at https://www.patreon.com/AndyShanu.
Medium: You can read my latest articles and insights on Medium at https://medium.com/@andysingal.
Kaggle: Check out my Kaggle profile for data science and machine learning projects at https://www.kaggle.com/alphasingal.
Huggingface: For natural language processing and AI-related projects, you can explore my Huggingface profile at https://huggingface.co/Andyrasika.
YouTube: To watch my video content, visit my YouTube channel at https://www.youtube.com/@andy111007.
LinkedIn: To stay updated on my latest projects and posts, you can follow me on LinkedIn. Here is the link to my profile: https://www.linkedin.com/in/ankushsingal/."
Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Remember, each “Like”, “Share”, and “Star” greatly contributes to my work and motivates me to continue producing more quality content. Thank you for your support!

Resources:

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote