OpenAI, the artificial intelligence company that unleashed ChatGPT on the world last November, is making the chatbot app a lot more chatty.
An upgrade to the ChatGPT mobile apps for iOS and Android announced today lets a person speak their queries to the chatbot and hear it respond with its own synthesized voice. The new version of ChatGPT also adds visual smarts: Upload or snap a photo from ChatGPT and the app will respond with a description of the image and offer more context, similar to Google’s Lens feature.
ChatGPT’s new capabilities show that OpenAI is treating its artificial intelligence models, which have been in the works for years now, as products with regular, iterative updates. The company’s surprise hit, ChatGPT, is looking more like a consumer app that competes with Apple’s Siri or Amazon’s Alexa.
Making the ChatGPT app more enticing could help OpenAI in its race against other AI companies, like Google, Anthropic, InflectionAI, and Midjourney, by providing a richer feed of data from users to help train its powerful AI engines. Feeding audio and visual data into the machine learning models behind ChatGPT may also help OpenAI’s long-term vision of creating more human-like intelligence.
OpenAI’s language models that power its chatbot, including the most recent, GPT-4, were created using vast amounts of text collected from various sources around the web. Many AI experts believe that, just as animal and human intelligence makes use of various types of sensory data, creating more advanced AI may require feeding algorithms audio and visual information as well as text.
Google’s next major AI model, Gemini, is widely rumored to be “multimodal,” meaning it will be able to handle more than just text, perhaps allowing video, images, and voice inputs. “From a model performance standpoint, intuitively we would expect multimodal models to outperform models trained on a single modality,” says Trevor Darrell, a professor at UC Berkeley and a cofounder of Prompt AI, a startup working on combining natural language with image generation and manipulation. “If we build a model using just language, no matter how powerful it is, it will only learn language.”
ChatGPT’s new voice generation technology—developed in-house by the company—also opens new opportunities for the company to license its technology to others. Spotify, for example, says it now plans to use OpenAI’s speech synthesis algorithms to pilot a feature that translates podcasts into additional languages, in an AI-generated imitation of the original podcaster’s voice.
The new version of the ChatGPT app has a headphones icon in the upper right and photo and camera icons in an expanding menu in the lower left. These voice and visual features work by converting the input information to text, using image or speech recognition, so the chatbot can generate a response. The app then responds via either voice or text, depending on what mode the user is in. When a Startup writer asked the new ChatGPT using her voice if it could “hear” her, the app responded, “I can’t hear you, but I can read and respond to your text messages,” because your voice query is actually being processed as text. It will respond in one of five voices, wholesomely named Juniper, Ember, Sky, Cove, or Breeze.