Detecting the Deceptive: Unmasking Deep Fake Voices

Community Article Published October 29, 2023

image/png

Introduction:

In an era where artificial intelligence continues to redefine the boundaries of technology, one of the most intriguing and concerning developments is the emergence of deep fake voices. These uncanny imitations of real human voices are crafted with remarkable precision and have the potential to deceive even the most discerning ears. In this article, we'll delve into the world of Audio Deep Fake Detection, exploring its significance, the challenges it poses, and the strategies employed to combat the rise of deceptive deep fake voices.

The concept of artificial intelligence has gained significant prominence throughout history, persisting as a subject of regular discussion and exploration in contemporary times. Artificial intelligence (AI) has been a recurring theme in numerous literary works and films, with its projected significance in future contexts. This thematic exploration of AI has been a subject of creative endeavors spanning several decades. In recent years, deepfake technology has emerged as a prominent subject of interest within the realm of artificial intelligence. Deepfake technology is widely recognized as an artificial intelligence and deep learning-based innovation. Numerous deepfake applications have had a big impact on the public recently. In addition to the production of manipulation films targeting individuals of high popularity, it is evident that deepfake technology possesses many potential applications across several domains. The objective of this study is to explore the potential applications of deepfake technology across many domains. Deepfake technology was examined in the study by concentrating on the concept of deep learning and referencing artificial intelligence technology. The study involved the classification of the many applications of deepfake technology by conducting a comprehensive literature analysis and analyzing examples of its usage in diverse domains. Based on the findings of the study, it is possible to categorize the significant applications of deepfake technology into four distinct groups. The previously mentioned categories include arts and entertainment, advertising and marketing, the film industry, political communication, and media.

image/png

The Role of Voice in AI:

The human voice is a powerful tool for communication, emotion, and identity. In the realm of AI, the role of voice has expanded dramatically, giving rise to a plethora of voice-related applications:

  1. Voice Assistants: Virtual assistants like Siri, Alexa, and Google Assistant rely on voice recognition technology to understand and respond to user commands.

  2. Text-to-Speech (TTS): AI-driven TTS systems transform written text into natural-sounding speech, enhancing accessibility and enabling natural human-machine interaction.

  3. Voice Authentication: Voice biometrics are used for security and authentication, allowing individuals to unlock devices or access sensitive information with their unique voiceprints.

  4. Audiobooks and Podcasts: AI has made it possible to convert written content into spoken words, expanding the reach and accessibility of literature and information.

Audio Deep Fake Detection: Revealing the Sounds of Deceit

  1. The Challenge of Audio Deepfake: With startling precision, audio deepfake technology can mimic a person's voice and speech pattern. This poses a significant challenge because it's increasingly difficult to distinguish between real and fake audio. Identifying audio deepfakes necessitates a multifaceted strategy that combines knowledge, technology, and alertness.

  2. Data Gathering and Arrangement: Data is the cornerstone of every deepfake detection algorithm. A diverse dataset encompassing both real and deepfake audio recordings is imperative. This dataset should represent a wide array of voices, languages, and settings. To extract significant elements from the audio, such as spectrograms or mel-frequency cepstral coefficients (MFCCs), preprocessing approaches are used. These characteristics serve as the foundation for machine learning models.

  3. Models of Machine Learning: Selecting the appropriate machine learning model is a crucial choice in the identification of audio deep fakes. Convolutional neural networks (CNNs), recurrent neural networks (RNNs), and hybrid architectures are examples of several types of models. Using pre-trained models intended for audio classification can be a good place to start.

  4. Extraction of Features: To distinguish between real and deep fake audio, feature extraction is essential. MFCCs, spectrogram pictures, or a mix of the two can be utilized as the model’s input features. These features capture the frequency and temporal aspects of the audio, aiding the model in identifying anomalies.

  5. Education and Assessment: The training procedure is the primary component of the detecting system. To train the model to distinguish between the two types of data, real and deep fake data are used. Techniques for augmenting data are applied to improve the robustness of the model. Many metrics are used to assess the model’s performance. To ensure the effectiveness of the model, testing with unseen data and cross-validation are essential processes.

  6. Optimization and After-Processing: The model's performance is maximized, and any biases or weaknesses are addressed through fine-tuning. Post-processing methods are used to improve the model’s predictions and lower the number of false positives.

  7. Continuous Monitoring and Real-Time Detection: The final goal is to deploy the model for real-time detection in audio files or streams. The model can function in real-world situations thanks to integration with audio processing frameworks and tools. It takes constant observation and updating to adjust to new deepfake methods.

  8. Ethical Considerations and User Education: For individuals and organizations alike, it's imperative that they are informed about the presence of audio deep fakes. Encouraging the responsible use of audio content and confirming its validity is a shared responsibility. Addressing moral and legal considerations, such as security and privacy concerns, is also critical.

  9. Ethical Considerations and User Education: For people and organizations alike, it is imperative that they are informed about the presence of audio deep fakes. It is our common duty to encourage the responsible use of audio content and to confirm its validity. Furthermore, it is critical to address moral and legal considerations, such as security and privacy concerns.

image/png

Source: Deepfake

Code Implementation

In this section, we will walk through the steps to download the Deepfake Detection Challenge dataset from Kaggle, which will serve as the foundation for your deepfake detection project. The Deepfake Detection Challenge dataset is a rich resource of manipulated and unaltered videos, an essential component for training and evaluating deepfake detection models.

Step 1: Import libraries

import numpy as np
import pandas as pd
import os
import matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
%matplotlib inline 
import cv2 as cv

from pathlib import Path
import subprocess
import librosa.display
import librosa.filters

DATA_FOLDER = '/kaggle/input/deepfake-detection-challenge'
TRAIN_SAMPLE_FOLDER = 'train_sample_videos'
TEST_FOLDER = 'test_videos'
INPUT_PATH = '../input/realfake045/all/all'
WAV_PATH = './wavs/'
print(f"Train samples: {len(os.listdir(os.path.join(DATA_FOLDER, TRAIN_SAMPLE_FOLDER)))}")
print(f"Test samples: {len(os.listdir(os.path.join(DATA_FOLDER, TEST_FOLDER)))}")

This code sets up variables for the paths to data files in a deepfake detection challenge. It defines "DATA_FOLDER" as the main data directory, "TRAIN_SAMPLE_FOLDER" as the folder containing labeled training videos, and "TEST_FOLDER" as the folder for testing videos. It uses the "os" module to count the files in these folders. The code utilizes f-strings to print the sample counts for training and testing data. This code is a helpful step in data exploration for a deepfake detection challenge, allowing easy assessment of data sample sizes.

Step2: Check files type

Here we check the train data files extensions. Most of the files looks to have mp4 extension, let's check if there is other extension as well.

train_list = list(os.listdir(os.path.join(DATA_FOLDER, TRAIN_SAMPLE_FOLDER)))
ext_dict = []
for file in train_list:
    file_ext = file.split('.')[1]
    if (file_ext not in ext_dict):
        ext_dict.append(file_ext)
print(f"Extensions: {ext_dict}")

Output:

Extensions: ['mp4', 'json']
Let's count how many files with each extensions there are.
for file_ext in ext_dict:
    print(f"Files with extension `{file_ext}`: {len([file for file in train_list if  file.endswith(file_ext)])}")

Output:

Files with extension `mp4`: 400
Files with extension `json`: 1

Let's repeat the same process for test videos folder.

test_list = list(os.listdir(os.path.join(DATA_FOLDER, TEST_FOLDER)))
ext_dict = []
for file in test_list:
    file_ext = file.split('.')[1]
    if (file_ext not in ext_dict):
        ext_dict.append(file_ext)
print(f"Extensions: {ext_dict}")
for file_ext in ext_dict:
    print(f"Files with extension `{file_ext}`: {len([file for file in train_list if  file.endswith(file_ext)])}")

Lets check the json file

json_file = [file for file in train_list if  file.endswith('json')][0]
print(f"JSON file: {json_file}")

This code snippet searches for a file in the train_list that ends with the extension .json and assigns it to the variable json_file.

Let's explore this JSON file.

def get_meta_from_json(path):
    df = pd.read_json(os.path.join(DATA_FOLDER, path, json_file))
    df = df.T
    return df

meta_train_df = get_meta_from_json(TRAIN_SAMPLE_FOLDER)
meta_train_df.head()

Output

    label	   split	  original
aagfhgtpmv.mp4	FAKE	train	vudstovrck.mp4
aapnvogymq.mp4	FAKE	train	jdubbvfswz.mp4
abarnvbtwb.mp4	REAL	train	None
abofeumbvv.mp4	FAKE	train	atvmxvwyns.mp4
abqwwspghj.mp4	FAKE	train	qzimuostzz.mp4

Step 3: Meta data exploration

Let's explore now the meta data in train sample.

Missing data

  1. We start by checking for any missing values.
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    tt = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    tt['Types'] = types
    return(np.transpose(tt))

This code defines a function missing_data(data) that takes a pandas DataFrame object data as input and returns a summary of the missing data in the DataFrame.

missing_data(meta_train_df)

Output

      label	 split	original
Total	0	  0	     77
Percent	0	  0	    19.25
Types object object	object

This code is calling the missing_data() function and passing the meta_train_df DataFrame as an argument.

  1. There are missing data 19.25% of the samples (or 77). We suspect that actually the real data has missing original (if we generalize from the data we glimpsed). Let's check this hypothesis.
missing_data(meta_train_df.loc[meta_train_df.label=='REAL'])

This code is calling the missing_data() function on a subset of the meta_train_df DataFrame that meets a specific condition, using the .loc method to select rows based on the value of the label column.

Step 4: Unique values

def unique_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    uniques = []
    for col in data.columns:
        unique = data[col].nunique()
        uniques.append(unique)
    tt['Uniques'] = uniques
    return(np.transpose(tt))

This code defines a function unique_values(data) that takes a pandas DataFrame object data as input and returns a summary of the unique values in the DataFrame.

  • Overall, this code is useful for quickly identifying the number of unique values in a pandas DataFrame, providing a summary of the number of unique values for each column in the DataFrame.
unique_values(meta_train_df)

This code is calling the unique_values() function and passing the meta_train_df DataFrame as an argument.

Step 5: Most frequent originals

def most_frequent_values(data):
    total = data.count()
    tt = pd.DataFrame(total)
    tt.columns = ['Total']
    items = []
    vals = []
    for col in data.columns:
        itm = data[col].value_counts().index[0]
        val = data[col].value_counts().values[0]
        items.append(itm)
        vals.append(val)
    tt['Most frequent item'] = items
    tt['Frequence'] = vals
    tt['Percent from total'] = np.round(vals / total * 100, 3)
    return(np.transpose(tt))
most_frequent_values(meta_train_df)

The code "most_frequent_values(meta_train_df)" is calling the "most_frequent_values" function with an argument named "meta_train_df". This suggests that "meta_train_df" is a pandas DataFrame, and the function is being used to calculate the most frequent value(s) and additional information for each column in this DataFrame.

Step 6: data distribution visualizations

def plot_count(feature, title, df, size=1):
    '''
    Plot count of classes / feature
    param: feature - the feature to analyze
    param: title - title to add to the graph
    param: df - dataframe from which we plot feature's classes distribution 
    param: size - default 1.
    '''
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:20], palette='Set3')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}%'.format(100*height/total),
                ha="center") 
    plt.show()    
plot_count('split', 'split (train)', meta_train_df)

image/png

Step 7: Video data exploration

In the following we will explore some of the video data.

Missing video (or meta) data We check first if the list of files in the meta info and the list from the folder are the same.

meta = np.array(list(meta_train_df.index))
storage = np.array([file for file in train_list if  file.endswith('mp4')])
print(f"Metadata: {meta.shape[0]}, Folder: {storage.shape[0]}")
print(f"Files in metadata and not in folder: {np.setdiff1d(meta,storage,assume_unique=False).shape[0]}")
print(f"Files in folder and not in metadata: {np.setdiff1d(storage,meta,assume_unique=False).shape[0]}")

Output

Metadata: 400, Folder: 400
Files in metadata and not in folder: 0
Files in folder and not in metadata: 0

Few fake videos

fake_train_sample_video = list(meta_train_df.loc[meta_train_df.label=='FAKE'].sample(3).index)
fake_train_sample_video

Output

['bguwlyazau.mp4', 'byfenovjnf.mp4', 'dsndhujjjb.mp4']

Modifying a function for displaying a selected image from a video

def display_image_from_video(video_path):
    '''
    input: video_path - path for video
    process:
    1. perform a video capture from the video
    2. read the image
    3. display the image
    '''
    capture_image = cv.VideoCapture(video_path) 
    ret, frame = capture_image.read()
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111)
    frame = cv.cvtColor(frame, cv.COLOR_BGR2RGB)
    ax.imshow(frame)
for video_file in fake_train_sample_video:
    display_image_from_video(os.path.join(DATA_FOLDER, TRAIN_SAMPLE_FOLDER, video_file))

Output: image/png

Let's try now the same for few of the images that are real.

real_train_sample_video = list(meta_train_df.loc[meta_train_df.label=='REAL'].sample(3).index)
real_train_sample_video

Output

['ciyoudyhly.mp4', 'ekcrtigpab.mp4', 'cfxkpiweqt.mp4']
for video_file in real_train_sample_video:
    display_image_from_video(os.path.join(DATA_FOLDER, TRAIN_SAMPLE_FOLDER, video_file))

image/png

Step 8: Videos with same original

meta_train_df['original'].value_counts()[0:5]

Output:

meawmsgiti.mp4    6
atvmxvwyns.mp4    6
qeumxirsme.mp4    5
kgbkktcjxf.mp4    5
qzklcjjxdq.mp4    4
Name: original, dtype: int64

modify our visualization function to work with multiple images.

def display_image_from_video_list(video_path_list, video_folder=TRAIN_SAMPLE_FOLDER):
    '''
    input: video_path_list - path for video
    process:
    0. for each video in the video path list
        1. perform a video capture from the video
        2. read the image
        3. display the image
    '''
    plt.figure()
    fig, ax = plt.subplots(2,3,figsize=(16,8))
    # we only show images extracted from the first 6 videos
    for i, video_file in enumerate(video_path_list[0:6]):
        video_path = os.path.join(DATA_FOLDER, video_folder,video_file)
        capture_image = cv.VideoCapture(video_path) 
        ret, frame = capture_image.read()
        frame = cv.cvtColor(frame, cv.COLOR_BGR2RGB)
        ax[i//3, i%3].imshow(frame)
        ax[i//3, i%3].set_title(f"Video: {video_file}")
        ax[i//3, i%3].axis('on')
same_original_fake_train_sample_video = list(meta_train_df.loc[meta_train_df.original=='meawmsgiti.mp4'].index)
display_image_from_video_list(same_original_fake_train_sample_video)

image/png

The overall purpose of the code is to display the first frame of each fake video file in the training set of the metadata DataFrame that was generated from the original video file named "meawmsgiti.mp4". This can be useful for analyzing the quality and characteristics of the fake videos generated from a specific original video.

Step 9: Test video files

Let's also look to few of the test data files.

test_videos = pd.DataFrame(list(os.listdir(os.path.join(DATA_FOLDER, TEST_FOLDER))), columns=['video'])
test_videos.head()

Let's visualize now one of the videos.

display_image_from_video(os.path.join(DATA_FOLDER, TEST_FOLDER, test_videos.iloc[0].video))

image/png

The purpose of the "display_image_from_video" function is to display the first frame of the specified video file as an image. Therefore, the overall purpose of the code is to display the first frame of the first video file in the "test" folder of the data directory, allowing for easy inspection of the content and quality of the video.

Step 10: Play video files

fake_videos = list(meta_train_df.loc[meta_train_df.label=='FAKE'].index)
from IPython.display import HTML
from base64 import b64encode

def play_video(video_file, subset=TRAIN_SAMPLE_FOLDER):
    '''
    Display video
    param: video_file - the name of the video file to display
    param: subset - the folder where the video file is located (can be TRAIN_SAMPLE_FOLDER or TEST_Folder)
    '''
    video_url = open(os.path.join(DATA_FOLDER, subset,video_file),'rb').read()
    data_url = "data:video/mp4;base64," + b64encode(video_url).decode()
    return HTML("""<video width=500 controls><source src="%s" type="video/mp4"></video>""" % data_url)
play_video(fake_videos[0])

<video controls autoplay src="

">

Step 11: Download The public data set found : https://www.kaggle.com/rakibilly/ffmpeg-static-build and https://www.kaggle.com/datasets/phoenix9032/realfake045

!tar xvf /kaggle/input/ffmpeg-static-build/ffmpeg-git-amd64-static.tar.xz
output_format = 'wav'  # can also use aac, wav, etc

output_dir = Path(f"{output_format}s")
Path(output_dir).mkdir(exist_ok=True, parents=True)
fake_name ='aaeflzzhvy'
real_name = 'flqgmnetsg'
list_of_files = []
for file in os.listdir(os.path.join(DATA_FOLDER,TRAIN_SAMPLE_FOLDER)):
    filename = os.path.join(DATA_FOLDER,TRAIN_SAMPLE_FOLDER)+file
    list_of_files.append(filename)
%%time
create_wav(list_of_files)

image/png

Conclusion

In conclusion, "Detecting the Deceptive: Unmasking Deep Fake Voices" sheds light on the ever-evolving realm of audio deep fake technology. As the digital era progresses, the ability to manipulate audio recordings with unprecedented realism has raised significant concerns, including misinformation, privacy breaches, and cybersecurity risks.

This article has delved into the intricate landscape of audio deep fake detection, elucidating the challenges faced in this domain. From the intricate process of data collection and arrangement to the utilization of various machine learning models, feature extraction techniques, and robust training procedures, the methodologies behind unmasking deep fake voices are diverse and demanding.

Furthermore, the critical phase of model optimization and after-processing ensures the highest levels of performance while addressing biases and weaknesses. Achieving real-time detection in audio streams and files is the ultimate goal, requiring continuous monitoring and updates to thwart new deep fake methods.

Not only is the article a technical exploration, but it also emphasizes the ethical considerations surrounding the responsible use of audio content. It underscores the collective responsibility to safeguard the integrity of audio information and addresses the moral and legal dimensions, including security and privacy.

In a world increasingly shaped by artificial intelligence, understanding and countering the rise of deceptive deep fake voices is a paramount endeavor. With vigilance, innovation, and a commitment to ethical principles, we can strive to preserve the authenticity of audio in an era of technological marvels and deceptions.

“Stay connected and support my work through various platforms:

Requests and questions: If you have a project in mind that you’d like me to work on or if you have any questions about the concepts I’ve explained, don’t hesitate to let me know. I’m always looking for new ideas for future Notebooks and I love helping to resolve any doubts you might have.

Remember, each “Like”, “Share”, and “Star” greatly contributes to my work and motivates me to continue producing more quality content. Thank you for your support!

Resources: