@awacke1 on Hugging Face: "I just completed getting all four aspects of the new OpenAI GPT-4-o Omni model…"

awacke1

posted an update May 14, 2024

Post

2485

I just completed getting all four aspects of the new OpenAI GPT-4-o Omni model to process Text, Image, Audio, and Video.

Check it out and let me know what you think!

Space: awacke1/GPT-4o-omni-text-audio-image-video

Discussion: awacke1/GPT-4o-omni-text-audio-image-video

Test Runs for All Four Modalities: awacke1/GPT-4o-omni-text-audio-image-video#1

--Aaron - @awacke1

athareja

May 15, 2024

•

edited May 15, 2024

This looks great, thanks for sharing. Are you using audio capabilities of GPT-4o or first converting audio to text and using its text capabilities. I saw in their announcement that audio capabilities are not publicly available to everyone through their API, so wanted to see if I am misunderstanding something.

Developers can also now access GPT-4o in the API as a text and vision model. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

awacke1

Jun 29, 2024

You can use whisper-1 for now and that pattern works great. The speech wav stream recorder is not in the code for openai yet. I use a streamlit recorder in order to get speech in which is working but I am looking for a better speech in/out technique. The audio to text is used as well and is how the video modality inputs its transcript for additive data input with the image slices from video. One thing I also did not see yet was the image generator inside the client api. That would be nice to add as well and also the speech synthesis.

taher30

May 18, 2024

Does this model use your API key? Is this billed or is this using a free model?

awacke1

Jun 29, 2024

It uses my openai key and org id and is hard to run in an open fashion due to usage. It uses the billed model.

Join the conversation