ChatGPT's Evolution: Now It Can "See, Hear, and Speak"

In a ground-breaking move, OpenAI has unveiled a major update to its ChatGPT platform, empowering it with the ability to analyse images, engage in fully verbal conversations, and react contextually to visual and auditory inputs.

A Multimodal Leap Forward

On Monday, OpenAI made the announcement that ChatGPT, powered by the GPT-3.5 and GPT-4 AI models, will now be able to analyze and respond to images as part of a text-based conversation. This is not just a minor tweak; it’s a significant step towards making AI interactions more intuitive and aligned with human communication patterns.

What’s New?

Image Recognition: Users can now upload one or more images for a conversation with ChatGPT. Whether it’s figuring out dinner options from pictures of your fridge or troubleshooting a malfunctioning grill, the AI can assist. OpenAI even showcased a promotional video where ChatGPT helps a user adjust a bike seat using uploaded photos.
Voice Features: The ChatGPT mobile app is set to introduce speech synthesis. When combined with its existing speech recognition capabilities, it paves the way for fully verbal interactions with the AI. Users can soon expect back-and-forth spoken conversations with ChatGPT, driven by a new text-to-speech model. Open AI has crafted multiple synthetic voices, such as Juniper, Sky, Cove, Ember, and Breeze, in collaboration with professional voice actors.

Rollout Plans

OpenAI has charted out a phased rollout of these features. ChatGPT Plus and enterprise subscribers can expect access within the next two weeks. Notably, speech synthesis will be exclusive to iOS and Android, while image recognition will be available across both web and mobile platforms.

Under the Hood

While OpenAI hasn’t divulged the intricate technical details, it’s known that multimodal AI models, like the one powering ChatGPT, transform text and images into a shared encoding space. This allows them to process diverse data types through a single neural network. Speculations suggest that OpenAI might be leveraging its CLIP model to bridge the gap between visual and text data, enabling ChatGPT to make contextual deductions across both mediums.

In Conclusion

OpenAI’s latest update to ChatGPT is more than just a technological advancement; it’s a testament to the rapid strides AI is making in becoming a more intuitive and versatile tool for human interaction.

Key Takeaways:

ChatGPT can now analyze and respond to images.
The AI platform will soon support fully verbal conversations.
Features will be rolled out to ChatGPT Plus and enterprise subscribers in the coming weeks.
OpenAI continues to push the boundaries of what’s possible with AI, making it more user-friendly and context-aware.

About the author

View All Posts

Stacy Cook

Stacy earned a B.S. in Computer Science with coursework in cybersecurity. She has 7 years of experience covering cloud platforms, AI tooling, enterprise software, and developer ecosystems. She is known for change log breakdowns and hands on explainers that help readers adopt new tools safely. She has guest judged university hackathons and mentors early career reporters on technical sourcing. Stacy climbs indoor routes, enjoys indie games, and keeps a home lab for testing. She writes the daily tech brief, coordinates product deep dives, and maintains our glossary of technical terms.

Add Comment

Click here to post a comment

Cancel reply

You must be logged in to post a comment.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

ChatGPT’s Evolution: Now It Can “See, Hear, and Speak”

A Multimodal Leap Forward

What’s New?

Rollout Plans

Under the Hood

In Conclusion