Machine Learning,  Machine Learning And Artificial Intelligence,  Programming

Detect Deepfakes and Hacked Audio Programmatically

Deepfakes are a type of synthetic media where a person replaces parts of an image or video with something else, like someone’s likeness. The act of creating deepfakes is known as deepfakery. Deepfakes can be used to create fake news, spread misinformation, or damage someone’s reputation. They can also be used by a service like JibJab just to be funny, although those are obviously cartoonish and not meant to actually fool anyone.

There are a number of different approaches to detecting deepfakes. Some of these approaches are based on image forensics, which is the study of digital images to identify and extract hidden information. Other approaches are based on machine learning, which is the use of algorithms to learn from data and make predictions. We’ll focus on using some pre-built algorithms when we get to code later.

Types of Deepfake Analysis

Image forensics approaches to deepfake detection can identify inconsistencies in the image or video that are caused by the deepfakery process. For example, deepfakes often contain artifacts that are not present in real images or videos. These artifacts can be used to identify deepfakes by trained human experts or by machine learning algorithms. Machine learning approaches can use trained datasets of real and fake images or videos to constantly improve on detection as it gets more data (like if users can upload known content that has been altered. Once trained, the models can be used to predict whether a new image or video is real or fake.

The accuracy of deepfake detection methods varies depending on the method and the quality of the deepfake, as well as how much data there is to train the scripts. Some methods are more effective at detecting low-quality deepfakes, while others are more effective at detecting high-quality deepfakes. As deepfake technology continues to improve, it is likely that deepfake detection methods will also need to improve. Researchers are working on developing new and more effective methods for detecting deepfakes, which might show up as various features in a framework like tensorflow or might manifest themselves as scripts that do some of the tasks necessary. The most common approaches to detectin include:

  • Temporal consistency analysis. This approach looks for inconsistencies in the temporal structure of the video, such as unnatural blinking or lip movements.
  • Spatial consistency analysis. This approach looks for inconsistencies in the spatial structure of the video, such as unnatural skin textures or lighting.
  • Audio analysis. This approach looks for inconsistencies in the audio track, such as unnatural voice pitch or intonation.
  • Machine learning. This approach uses purely statistical models, often a combination of the above options, to learn to distinguish between real and fake videos.

It is important to note that no single detection method is perfect. Deepfakes can be very difficult to detect, especially if they are high-quality. However, by using a combination of different detection methods, it is possible to increase the chances of detecting deepfakes. So for example, if a machine learning approach can detect 80% with a high probability of success, then a human can go through the other 20 and improve on the models by adding positive detections to one dataset and negative detections to another, or to put more succinctly, train the data.

For most regular humans, though, we just have to be skeptical of any video or image online, especially if it seems too good to be true.

Detecting AI-generated Audio

Now, let’s look at some programmatic approaches to detection. The first step is to load the audio file and compute the spectral features of the audio, or translate it into numbers that a statistics model can analyze. The spectral features are a set of numbers that describe the frequency content of the audio. The below script then converts the spectral features to a tensor. A tensor is a multi-dimensional array. It is a generalization of a vector and a matrix. 

Tensors can be used to represent a wide variety of data, including images, videos, and sound. The rank of a tensor is the number of dimensions that it has. A scalar is a rank-0 tensor, a vector is a rank-1 tensor, and a matrix is a rank-2 tensor. The shape of a tensor is the number of elements in each dimension. For example, a rank-2 tensor with a shape of (3, 4) has 12 elements. The components of a tensor are the individual elements that make up the tensor. The components of a tensor are indexed by their position in the tensor. For example, the component of a rank-2 tensor with a shape of (3, 4) at position (1, 2) is the element in the first row and second column.

Tensors are used in machine learning and artificial intelligence to represent data and to perform mathematical operations on data. For example, tensors can be used to represent images, videos, and sound. Tensors can also be used to perform operations such as classification, regression, and clustering. That requires a model to compare one big vector and matrix to another, where those that look like something (like a K-Nearest Neighbor) can be classified or clustered together for similarity. Thus, in a simplistic script, we can have a variable called model and fill it with and load the AI voice detection model. The tf.keras.saving.load_model module loads a model that was previously saved via model.save(). We can also use that to continue to train the data set (e.g. there’s a button for fake or not fake that then tags samples as each). Finally, once loaded the script does the easy part: predicts whether the audio sample was generated by AI.

The model is trained on a dataset of audio samples that were generated by AI and a dataset of audio samples that were not generated by AI. The script returns a prediction. If the prediction is greater than 0.5, then the audio sample was generated by AI. If the prediction is less than 0.5, then the audio sample was not generated by AI.

The Model

The AI voice detection model is not perfect (in fact I don’t think there’s ever been a perfect machine learning model, but this one can get us part of the way there). The accuracy of the AI voice detection model depends on a number of factors, including the quality of the audio sample, the quality of the AI voice detection model, and the amount of training data that the AI voice detection model was trained on. But in the following script, we’ll use a combination of librosa, numpy, and tensorflow to get a good start on a script:

import librosa
import numpy as np
import tensorflow as tf

def detect_ai_generated_voice(audio_file):

  """Detects if an audio sample was generated by AI.
  Args:
    audio_file: The path to the audio file to be analyzed.
  Returns:
    True if the audio sample was generated by AI, False otherwise.
  """

  # Load the audio file.
  audio, sr = librosa.load(audio_file)

  # Compute the spectral features of the audio.
  spectral_features = librosa.feature.mfcc(audio, sr)

  # Convert the spectral features to a tensor.
  spectral_features_tensor = tf.convert_to_tensor(spectral_features)

  # Load the AI voice detection model.
  model = tf.keras.models.load_model("ai_voice_detection_model.h5")

  # Predict whether the audio sample was generated by AI.
  prediction = model.predict(spectral_features_tensor)

  # Return the prediction.
  return prediction[0] > 0.5

Detecting Doctored Conent

Another thing to look for is that an audio track was generated by a human, but has been doctored (be it by a script or some other means). Most of us who make audio use a number of filters, but the last one I usually run is to level out the audio. For example, I’ve found most people get quieter the longer they talk. We’d then turn up the volume gradually to accomodate for this. This would change the results of spectral analysis vs the original, unedited clip. 

import librosa
import numpy as np

def detect_doctored_voice(audio_file):

  """Detects if a voice sample was doctored.
  Args:
    audio_file: The path to the audio file to be analyzed.
  Returns:
    True if the voice sample was doctored, False otherwise.
  """

  # Load the audio file.
  audio, sr = librosa.load(audio_file)

  # Compute the spectral features of the audio.
  spectral_features = librosa.feature.mfcc(audio, sr)

  # Compute the mean and standard deviation of the spectral features.
  mean = np.mean(spectral_features, axis=0)
  std = np.std(spectral_features, axis=0)

  # Compute the distance between the spectral features of the audio and the mean and standard deviation.
  distances = np.linalg.norm(spectral_features - mean, axis=1) / std

  # If the distance is greater than a threshold, then the voice sample was doctored.
  threshold = 0.1
  return np.any(distances > threshold)

The above script works by first loading the audio file and computing the spectral features of the audio. The spectral features are a set of numbers that describe the frequency content of the audio. The script then computes the mean and standard deviation of the spectral features. Finally, the script computes the distance between the spectral features of the audio and the mean and standard deviation. If the distance is greater than a threshold, then the voice sample was doctored. The threshold value can be adjusted to make the script more or less sensitive to doctored voice samples. A lower threshold will make the script more sensitive, while a higher threshold will make the script less sensitive.

Notice that in the above scripts, we’re mixing and matching the concept of machine learning and deep learning. The main difference between machine learning and deep learning is the type of algorithm that is used. Machine learning algorithms can be simple or complex, but they are all based on statistical methods. Deep learning algorithms, on the other hand, are based on artificial neural networks.

Another difference between machine learning and deep learning is the amount of data that is required to train the algorithm. Machine learning algorithms can be trained on relatively small datasets, but deep learning algorithms require large datasets of data.

Finally, machine learning algorithms are typically used for tasks that are relatively simple, such as classification and regression. Deep learning algorithms, on the other hand, are typically used for tasks that are more complex, such as image recognition and natural language processing.

If you are trying to accomplish a simple task, such as classification or regression, then machine learning may be a good option. If you are trying to accomplish a more complex task, such as image recognition or natural language processing, then deep learning may be a better option. It is also important to consider the amount of data that you have available. If you have a relatively small dataset, then machine learning may be a better option. If you have a large dataset, then deep learning may be a better option. Finally, consider the cost of training the algorithm. Machine learning algorithms can be trained on relatively inexpensive hardware, while deep learning algorithms require more expensive hardware (or virtual instances if leveraging hosted services).

Hyperparameters

We showcased two, but there are a number of other machine learning models that can be used to detect deep fakes. These models are trained on a dataset of real and fake images or videos. The models learn to identify the features that are characteristic of deep fakes, such as unnatural facial expressions, lighting inconsistencies, and artifacts from the editing process. The performance of a deep fake detection model can be improved by tuning the hyperparameters of the model. Hyperparameters are the settings that control the learning process. 

The best hyperparameters for deep fake detection will vary depending on the model and the dataset. However, there are some general guidelines that can be followed.

  • Learning rate: The learning rate controls how quickly the model learns. A higher learning rate will cause the model to learn more quickly, but it may also cause the model to overfit the data. A lower learning rate will cause the model to learn more slowly, but it may be less likely to overfit the data.
  • Number of epochs: The number of epochs controls how many times the model is trained on the data. A higher number of epochs will cause the model to learn more, but it may also cause the model to overfit the data. A lower number of epochs will cause the model to learn less, but it may be less likely to overfit the data.
  • Batch size: The batch size controls how much data is used to train the model at a time. A larger batch size will cause the model to learn more efficiently, but it may also require more memory. A smaller batch size will require less memory, but it may take longer to train the model.

It is important to experiment with different hyperparameters to find the best combination for your model and dataset (this is one of the reasons it can take a solid year to get a more accurate model). Techniques like grid search automate this process. Grid search involves creating a grid of different hyperparameter combinations and evaluating the performance of each combination on a holdout set of data. The combination with the best performance can then be used to train the final model.

In addition to tuning the hyperparameters, there are a number of other things that can be done to improve the performance of a deep fake detection model. Using a larger and more diverse dataset gives the model more information to learn from. A more diverse dataset will help the model to generalize to new types of deep fakes. Using a more powerful model will be able to learn more complex features. Using a regularization technique prevents the model from overfitting the data. Switching each can require restructuring data. Part of tweaking that data set is to edit the model, which can again be done automatically, but also by hand.

Editing TensorFlow Models

TensorFlow models can be edited by directly editing the model files. The model files are typically saved in the TensorFlow SavedModel format. The SavedModel format is a binary format that stores the model’s architecture, weights, and other parameters. The TensorFlow API provides a number of functions for editing models. These functions can be used to add new layers, remove layers, and change the parameters of layers.

To add a new layer to a model, use the tf.keras.layers.add() function. The tf.keras.layers.add() function takes two arguments: the name of the new layer and the type of the new layer. To remove a layer from a model, use the tf.keras.models.Model.pop() function. The tf.keras.models.Model.pop() function takes one argument: the name of the layer to remove. To change the parameters of a layer, use the tf.keras.layers.set_weights() function. The tf.keras.layers.set_weights() function takes two arguments: the name of the layer and the new weights for the layer.

The TensorFlow Model Garden is a collection of pre-trained TensorFlow models that can be used as a starting point for creating new models. The Model Garden also provides a number of tools for editing models, such as the Model Converter and the Model Zoo.

The Model Converter can be used to convert a TensorFlow model to a different format, such as the Keras format or the TensorFlow Lite format. The Model Zoo provides a number of pre-trained models that can be used as a starting point for creating new models.

The TensorFlow model files are typically saved in the TensorFlow SavedModel format. The SavedModel format is a binary format that stores the model’s architecture, weights, and other parameters. The model files can be edited using a text editor or a specialized tool. When editing the model files, it is important to be careful not to change the model’s architecture or the order of the layers. One small change (like an extraneous carriage return or comma) can cause the whole model file to not lint properly. Think butterflies and tsunamis.

Ultimately, the bigger the data set, the more we can train a model. This is one of the reasons it’s great to have users help continue to classify materials. For example, if a sound clip is AI-generated but it doesn’t get detected, then having a user classify will make subsequent tests more accurate. The hyperparameters help to supplement the accuracy of algorithms. However, while my initial tests can get me at about a 70% success rate, it might take a year of training to get to 90% and then probably another year to get to 95% and another for each percentage point increase, with more incremental gains above 99%, but likely never reaching 3 9s given that the technology to create deepfakes will evolve.