Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
awacke1 
posted an update May 14

This looks great, thanks for sharing. Are you using audio capabilities of GPT-4o or first converting audio to text and using its text capabilities. I saw in their announcement that audio capabilities are not publicly available to everyone through their API, so wanted to see if I am misunderstanding something.

Developers can also now access GPT-4o in the API as a text and vision model. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.

·

You can use whisper-1 for now and that pattern works great. The speech wav stream recorder is not in the code for openai yet. I use a streamlit recorder in order to get speech in which is working but I am looking for a better speech in/out technique. The audio to text is used as well and is how the video modality inputs its transcript for additive data input with the image slices from video. One thing I also did not see yet was the image generator inside the client api. That would be nice to add as well and also the speech synthesis.

Does this model use your API key? Is this billed or is this using a free model?

·

It uses my openai key and org id and is hard to run in an open fashion due to usage. It uses the billed model.