Decoding GPT-4'o': In-Depth Exploration of Its Mechanisms and Creating Similar AI.

Community Article Published May 21, 2024

OpenAI has launched the groundbreaking AI GPT-4'o', a model that is a mixture of many models. In this blog post, we will discuss how GPT-4'o' works and how to create this kind of model.

0. GPT 4'o' Capabilities

Video Chat. (First time introduced feature)
Faster and Human Like Voice Chat. (It even shows emotions and change tones.)
Text Generation, Image Generation, Image QnA, Document QnA, Video QnA ,Sequential Image Generation, Image to 3d and best thing is All these things are Packed in 1 Modal.
Supports 50+ languages.

See Examples in OpenAI Post

1. How GPT 4'o' works.

Firstly GPT 4o working is mainly Divided into 3 parts.

1. SuperChat

As, GPT 4 already achieved Sequential image generation and image QnA. They have to just add doc QnA ,Video QnA and 3d generation. For, tech Giant like OpenAI it is just a piece of cake for them. This can be possible with methods we discuss at end.

2. Voice Chat

OpenAI has integrated TTS (Text-to-Speech) and STT (Speech-to-Text) into a single module, removing the text generation component they previously used. This means that when you speak, the AI analyzes your tone and words to create response in audio in real-time, similar to how streaming is used in text generation. In my opinion, OpenAi made this model comparatively less powerful because it is primarily designed for human interaction, and thus, the AI is trained accordingly.

3. Video Chat

Video chat is not actually a live video interaction. The AI captures an image at the start of the conversation and takes additional images as needed or instructed. It then employs Zero Shot Image Classification to respond to user queries. This module utilizes a more powerful model than voice chat because the AI can address a wider range of requests when it has visual information. For example, it can identify people, places, solve complex mathematical problems, detect coding errors, and much more which means it can do many things as compared to simple voice chat.

Image depicting what people thinks of how OpenGPT-4 works vs Reality.

What you think

How it actually works

2. Creating AI Like GPT 4o

We, also make 3 models like OpenAI but before these There are two methods for creating every model. First, it's important to understand them.

1. MultiModalification or Mixture of Modal Method

This method combines 2 or more modals according to their functionality to create a new, powerful, multifunctional model, It aso requires further training.

2. Duct Tape Method

In this method You just need to use different types of Modals or API for doing Different task without ANY TRAINING.

Making of SuperChat Model

MultiModalification or Mixture of Modal Method To create SuperChat model we need to combine Text Generation, Image Generation, Image Classification, Document Classification, Video Classification models. Use the same process used in Idefics 2. A model that combines zero-shot image classification and text generation modal, Idefics 2 can chat with you and answer questions based on images.

Duct Tape Method Method without API - It include One base Modal which PROMPTED to identify which type of task is that and then send users prompt to that specific type of modal then send output to user. Optional: Use text gen modal at end to add some words, to make answer more realistic. Method with API - One base model prompted to use API on specific type of query. This method is utilized by Copilot. For instance, when it's requested to create images, compose songs, conduct web searches, or answer questions from images, it uses an API of that task to accomplish that task.

Recommended models from which you can create SuperChat Modal as powerful as GPT 4o

Base Modal - Llama 3 70B
Image Generation: Pixart Sigma or RealVisXL
Zero Shot Image Classification: Sigslip
Zero Shot Video Classification: Xclip
Sequential Image Gen - Control SDxl
Zero Shot Doc Classification - idf
3d gen - Instant Mesh
Other Models - Animate Diff lightning

Making of VoiceChat Model

MultiModalification or Mixture of Modal Method To develop a human-like speaking AI that also exhibits emotions, high-quality training data is essential. Additionally, an emotion identification model is necessary to recognize users' emotions and Text gen model who understands users emotion.

Duct Tape Method It include One stt Modal to encode users prompt with emotion to text gen modal with emotion encoded in answer and utilizing a TTS such as Parler TTS Expresso can further infuse emotion into the output.

Suggested Models

Speech to Text - Whisper
ChatModal - Llama3 8b
Text to Speech - Parler tts Expresso
Emotion identifier - Speech Emotion Recognition

Making of VideoChat Model

As previously mentioned, it only captures images. Thus, a zero-shot image classification model is necessary, while the rest remains the same as the voice chat model. However, it also requires a highly intelligent model, due to the increased use case with vision.

Suggested Models

ZeroShot Image Classification : Sigslip
Speech to Text - Whisper
ChatModal - Llama3 8b
Text to Speech - Parler tts Expresso
Optional - Speech Emotion Recognition

Alternatively

Image QnA Model - Idefics 2
VoiceChat Model

Making of Similar AI

Covered in Next Blog: https://huggingface.co/blog/KingNish/opengpt-4o-working

Upvote